MSN

Archived posts from the 'MSN' Category

WTF have Google, Bing, and Yahoo cooking?

Posted on 17 September, 2010

Folks, I’ve got good news. As a matter of fact, they’re so good that they will revolutionize SEO. A little bird told me, that the major search engines secretly teamed up to solve the problem of context and meaning as a ranking factor.

They’ve invented a new Web standard that allows content producers to steer search engine ranking algos. Its code name is ADL, probably standing for Aided Derivative Latch, a smart technology based on the groundwork of addressing tidbits of information developed by Hollerith and Neumann decades ago.

According to my sources, ADL will be launched next month at SMX East in New York City. In order to get you guys primed in a timely manner, here I’m going to leak the specs:

WTF - The official SEO standard, supported by Google, Yahoo & Bing

Word Targeting Funnel (WTF) is a set of indexer directives that get applied to Web resources as meta data. WTF comes with a few subsets for special use cases, details below. Here’s an example:

<meta name="WTF" content="document context" href="http://google.com/search?q=WTF" />

This directive tells search engines, that the content of the page is closely related to the resource supplied in the META element’s HREF attribute.

As you’ve certainly noticed, you can target a specific SERP, too. That’s somewhat complicated, because the engineers couldn’t agree which search engine should define a document’s search query context. Fortunately, they finally found this compromise:

<meta name="WTF" content="document context" href="http://google.com/search?q=WTF || http://www.bing.com/search?q=WTF || http://search.yahoo.com/search?q=WTF" />

As far as I know, this will even work if you change the order of URIs. That is, if you’re a Bing fanboy, you can mention Bing before Google and Yahoo.

A more practical example, taken from an affiliate’s sales pitch for viagra that participated in the BETA test, leads us to the first subset:

Subset WTFm — Word Targeting Funnel for medical terms

<meta name="WTF" content="document context" href="http://www.pfizer.com/files/products/uspi_viagra.pdf" />

This directive will convince search engines that the offered product indeed is not a clone like Cialis.

Subset WTFa — Word Targeting Funnel for acronyms

<meta name="WTFa" content="WTF" href="http://www.wtf.org/" />

When a Web resource contains the acronym “WTF”, search engines will link it to the World Taekwondo Federation, not to Your Ranting and Debating Resource at www.wtf.com.

Subset WTFo — Word Targeting Funnel for offensive language

<meta name="WTFo" content="meaning of terms" href="http://www.noslang.com/" />

If a search engine doesn’t know the meaning of terms I really can’t quote here, it will lookup the Internet Slang Directory. You can define alternatives, though:

<meta name="WTFo" content="alternate meaning of terms" href="http://dictionary.babylon.com/language/slang/low-life-glossary/" />

WTF, even more?

Of course we’ve got more subsets, like WTFi for instant searches. Because I appreciate unfair advantages, I won’t reveal more. Just one more goody: it works for PDF, Flash content and heavily ajax’ed stuff, too.

This is the very first newish indexer directive that search engines introduce with support for both META elements and HTTP headers as well. Like with the X-Robots-Tag, you can use an X-WTF-Tag HTTP header:
X-WTF-Tag: Name: WTFb, Content: SEO Bullshit, Href: http://seobullshit.com/

As for the little bird, well, that’s a lie. Sorry. There’s no such bird. It’s bugs I left last time I visited Google’s labs:
<meta name="WTF" content="bug,bugs,bird,birds" href="http://www.spylife.com/keysnoop.html" />

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

28 comments Sebastian | MSN, Search Quality, SEO, Yahoo, Google

Get yourself a smart robots.txt

Posted on 25 February, 2010

greedy and aggressive web robots steal your content Crawlers and other Web robots are the plague of today’s InterWebs. Some bots like search engine crawlers behave (IOW respect the Robots Exclusion Protocol - REP), others don’t. Behaving or not, most bots just steal your content. You don’t appreciate that, so block them.

This pamphlet is about blocking behaving bots with a smart robots.txt file. I’ll show you how you can restrict crawling to bots operated by major search engines -that bring you nice traffic- while keeping the nasty (or useless, traffic-wise) bots out of the game.

The basic idea is that blocking all bots -with very few exceptions- makes more sense than maintaining kinda Web robots who’s who in your robots.txt file. You decide whether a bot, respectively the service it crawls for, does you any good, or not. If a crawler like Googlebot or Slurp needs access to your content to generate free targeted (search engine) traffic, put it on your white list. All the remaining bots will run into a bold Disallow: /.

Of course that’s not exactly the popular way to handle crawlers. The standard is a robots.txt that allows all crawlers to steal your content, restricting just a few exceptions, or no robots.txt at all (weak, very weak). That’s bullshit. You can’t handle a gazillion bots with a black list.

Even bots that respect the REP can harm your search engine rankings, or reveal sensitive information to your competitors. Every minute a new bots turns up. You can’t manage all of them, and you can’t trust any (behaving) bot. Or, as the master of bot control explains: “That’s the only thing I’m concerned with: what do I get in return. If it’s nothing, it’s blocked“.

Also, large robots.txt files handling tons of bots are fault prone. It’s easy to fuck up a complete robots.txt with a simple syntax error in one user agent section. If you on the other hand verify legit crawlers and output only instructions aimed at the Web robot actually requesting your robots.txt, plus a fallback section that blocks everything else, debugging robots.txt becomes a breeze, and you don’t enlighten your competitors.

If you’re a smart webmaster agreeing with this approach, here’s your ToDo-List:
• Grab the code
• Install
• Customize
• Test
• Implement.
On error read further.

The anatomy of a smart robots.txt

Everything below goes for Web sites hosted on Apache with PHP installed. If you suffer from something else, you’re somewhat fucked. The code isn’t elegant. I’ve tried to keep it easy to understand even for noobs — at the expense of occasional lengthiness and redundancy.

Install

First of all, you should train Apache to parse your robots.txt file for PHP. You can do this by configuring all .txt files as PHP scripts, but that’s kinda cumbersome when you serve other plain text files with a .txt extension from your server, because you’d have to add a leading <?php ?> string to all of them. Hence you add this code snippet to your root’s .htaccess file:<FilesMatch ^robots\.txt$> SetHandler application/x-httpd-php </FilesMatch>
As long as you’re testing and customizing my script, make that ^smart_robots\.txt$.

Next grab the code and extract it into your document root directory. Do not rename /smart_robots.txt to /robots.txt until you’ve customized the PHP code!

For testing purposes you can use the logRequest() function. Probably it’s a good idea to CHMOD /smart_robots_log.txt 0777 then. Don’t leave that in a production system, better log accesses to /robots.txt in your database. The same goes for the blockIp() function, which in fact is a dummy.

Customize

Search the code for #EDIT and edit it accordingly. /smart_robots.txt is the robots.txt file, /smart_robots_inc.php defines some variables as well as functions that detect Googlebot, MSNbot, and Slurp. To add a crawler, you need to write a isSomecrawler() function in /smart_robots_inc.php, and a piece of code that outputs the robots.txt statements for this crawler in /smart_robots.txt, respectively /robots.txt once you’ve launched your smart robots.txt.

Let’s look at /smart_robots.txt. First of all, it sets the canonical server name, change that to yours. After routing robots.txt request logging to a flat file (change that to a database table!) it includes /smart_robots_inc.php.

Next it sends some HTTP headers that you shouldn’t change. I mean, when you hide the robots.txt statements served`only to authenticated search engine crawlers from your competitors, it doesn’t make sense to allow search engines to display a cached copy of their exclusive robots.txt right from their SERPs.

As a side note: if you want to know what your competitor really shoves into their robots.txt, then just link to it, wait for indexing, and view its cached copy. To test your own robots.txt with Googlebot, you can login to GWC and fetch it as Googlebot. It’s a shame that the other search engines don’t provide a feature like that.

When you implement the whitelisted crawler method, you really should provide a contact page for crawling requests. So please change the “In order to gain permissions to crawl blocked site areas…” comment.

Next up are the search engine specific crawler directives. You put them as if (isGooglebot()) { $content .= " User-agent: Googlebot Disallow: … \n\n"; }
If your URIs contain double quotes, escape them as \" in your crawler directives. (The function isGooglebot() is located in /smart_robots_inc.php.)

Please note that you need to output at least one empty line before each User-agent: section. Repeat that for each accepted crawler, before you output $content .= "User-agent: * Disallow: / \n\n";
Every behaving Web robot that’s not whitelisted will bounce at the Disallow: /.

Before $content is sent to the user agent, rogue bots receive their well deserved 403-GetTheFuckOuttaHere HTTP response header. Rogue bots include SEOs surfing with a Googlebot user agent name, as well as all SEO tools that spoof the user agent. Make sure that you do not output a single byte -for example leading whitespaces, a debug message, or a #comment- before the print $content; statement.

Blocking rogue bots is important. If you discover a rogue bot -for example a scraper that pretends to be Googlebot- during a robots.txt request, make sure that anybody coming from its IP with the same user agent string can’t access your content!

Bear in mind that each and every piece of content served from your site should implement rogue bot detection, that’s doable even with non-HTML resources like images or PDFs.

Finally we deliver the user agent specific robots.txt and terminate the connection.

Now let’s look at /smart_robots_inc.php. Don’t fuck-up the variable definitions and routines that populate them or deal with the requestor’s IP addy.

Customize the functions blockIp() and logRequest(). blockIp() should populate a database table of IPs that will never see your content, and logRequest() should store bot requests (not only of robots.txt) in your database, too. Speaking of bot IPs, most probably you want to get access to a feed serving search engine crawler IPs that’s maintained 24/7 and updated every 6 hours: here you go (don’t use it for deceptive cloaking, promised?).

/smart_robots_inc.php comes with functions that detect Googlebot, MSNbot, and Slurp.

Most search engines tell how you can verify their crawlers and which crawler directives their user agents support. To add a crawler, just adapt my code. For example to add Yandex, test the host name for a leading “spider” and trailing “.yandex.ru” string and inbetween an integer, like in the isSlurp() function.

Test

Develop your stuff in /smart_robots.txt, test it with a browser and by monitoring the access log (file). With Googlebot you don’t need to wait for crawler visits, you can use the “Fetch as Googlebot” thingy in your webmaster console.

Define a regular test procedure for your production system, too. Closely monitor your raw logs for changes the search engines apply to their crawling behavior. It could happen that Bing sends out a crawler from “.search.live.com” by accident, or that someone at Yahoo starts an ancient test bot that still uses an “inktomisearch.com” host name.

Don’t rely on my crawler detection routines. They’re dumped from memory in a hurry, I’ve tested only isGooglebot(). My code is meant as just a rough outline of the concept. It’s up to you to make it smart.

Launch

Rename /smart_robots.txt to /robots.txt replacing your static /robots.txt file. Done.

The output of a smart robots.txt

When you download a smart robots.txt with your browser, wget, or any other tool that comes with user agent spoofing, you’ll see a 403 or something like:

HTTP/1.1 200 OK Date: Wed, 24 Feb 2010 16:14:50 GMT Server: AOL WebSrv/0.87 beta (Unix) at 127.0.0.1 X-Powered-By: sebastians-pamphlets.com X-Robots-Tag: noindex, noarchive, nosnippet Connection: close Transfer-Encoding: chunked Content-Type: text/plain;charset=iso-8859-1
# In order to gain permissions to crawl blocked site areas # please contact the webmaster via # http://sebastians-pamphlets.com/contact/webmaster/?inquiry=cadging-bot
User-agent: * Disallow: /(the contact form URI above doesn’t exist)

whilst a real search engine crawler like Googlebot gets slightly different contents:

HTTP/1.1 200 OK Date: Wed, 24 Feb 2010 16:14:50 GMT Server: AOL WebSrv/0.87 beta (Unix) at 127.0.0.1 X-Powered-By: sebastians-pamphlets.com X-Robots-Tag: noindex, noarchive, nosnippet Connection: close Transfer-Encoding: chunked Content-Type: text/plain; charset=iso-8859-1
# In order to gain permissions to crawl blocked site areas # please contact the webmaster via # http://sebastians-pamphlets.com/contact/webmaster/?inquiry=cadging-bot User-agent: Googlebot Allow: / Disallow: Sitemap: http://sebastians-pamphlets.com/sitemap.xml
User-agent: * Disallow: /

Search engines hide important information from webmasters

Unfortunately, most search engines don’t provide enough information about their crawling. For example, last time I looked Google doesn’t even mention the Googlebot-News user agent in their help files, nor do they list all their user agent strings. Check your raw logs for “Googlebot-” and you’ll find tons of Googlebot-Mobile crawlers with various user agent strings. For proper content delivery based on reliable user agent detection webmasters do need such information.

I’ve nudged Google and their response was that they don’t plan to update their crawler info pages in the forseeable future. Sad. As for the other search engines, check their webmaster information pages and judge for yourself. Also sad. A not exactly remote search engine didn’t even announce properly that they’ve changed their crawler host names a while ago. Very sad. A search engine changing their crawler host names breaks code on many websites.

Since search engines don’t cooperate with webmasters, go check your log files for all the information you need to steer their crawling, and to deliver the right contents to each spider fetching your contents “on behalf of” particular user agents.

Enjoy.

Changelog:

2010-03-02: Fixed a reporting issue. 403-GTFOH responses to rogue bots were logged as 200-OK. Scanning the robots.txt access log /smart_robots_log.txt for 403s now provides a list of IPs and user agents that must not see anything of your content.

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

22 comments Sebastian | Crawler Directives, Web development, X-Robots-Tag, MSN, Cloaking, .htaccess, Yahoo, SEO, robots.txt, Webmaster Central, Google

URI canonicalization with an X-Canonical-URI HTTP header

Posted on 22 December, 2009

X-Canonical-URI HTTO Header Dear search engines, you owe me one for persistently nagging you on your bugs, flaws and faults. In other words, I’m desperately in need of a good reason to praise your wisdom and whatnot. From this year’s x-mas wish list:

All search engines obey the X-Canonical-URI HTTP header

The rel=canonical link element is a great tool, at least if applied properly, but sometimes it’s a royal pain in the ass.

Inserting rel=canonical link elements into huge conglomerates of cluttered scripts and static files is a nightmare. Sometimes the scripts creating the most URI clutter are compiled, and there’s no way to get a hand on the source code to change them.

Also, lots of resources can’t be stuffed with HTML’s link elements, for example dynamically created PDFs, plain text files, or images.

It’s not always possible to revamp old scripts, some projects just lack a suitable budget. And in some cases 301 redirects aren’t a doable option, for example when the destination URI is #5 in a redirect chain that can’t get shortened because the redirects are performed by a 3rd party that doesn’t cooperate.

This one, on the other hand, is elegant and scalable:

if (messedUp($_SERVER["REQUEST_URI"])) {
header(”X-Canonical-URI: $canonicalUri”);
}

Or:
header(”Link: <http://example.com/canonical-uri/>; rel=canonical”);

Coding an HTTP request handler that takes care of URI canonicalization before any script gets invoked, and before any static file gets served, is the way to go for such fuddy-duddy sites.

By the way, having all URI canonicalization routines in one piece of code is way more transparent, and way better manageable, than a bazillion of isolated link elements spread over tons of resources. So that might be a feasible procedure for non-ancient sites, too.

red crab blackmailing search engines Dear search engines, if you make that happen, I promise that I don’t tweet your products with a “#crap” hashtag for the whole rest of this year. Deal?

And yes, I know I’m somewhat late, two days before x-mas, but you’ve got smart developers, haven’t you? So please, go get your ‘code monkeys’ to work and surprise me. Thanks.

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

11 comments Sebastian | Web development, MSN, Crawler Directives, SEO, Yahoo, Google

As if sloppy social media users ain’t bad enough … search engines support traffic theft

Posted on 26 October, 2009

Prepare for a dose of techy tin foil hattery. [Skip rant] Again, I’m going to rant about a nightmare that Twitter & Co created with their crappy, thoughtless and shortsighted software designs: URI shorteners (yup, it’s URI, not URL).

Recap: Each and every 3rd party URI shortener is evil by design. Those questionable services do/will steal your traffic and your Google juice, mislead and piss off your potential ~~visitors~~ customers, and hurt you in countless other ways. If you consider yourself south of sanity, do not make use of shortened URIs you don’t own.

Actually, this pamphlet is not about sloppy social media users who shoot themselves in both feet, and it’s not about unscrupulous micro blogging platforms that force their users to hand over their assets to felonious traffic thieves. It’s about search engines that, in my humble opinion, handle the sURL dilemma totally wrong.

Some of my claims are based on experiments that I’m not willing to reveal (yet). For example I won’t explain sneaky URI hijacking or how I stole a portion of tinyurl.com’s search engine traffic with a shortened URI, passing searchers to a charity site, although it seems the search engine I’ve gamed has closed this particular loophole now. There’re still way too much playgrounds for deceptive tactics involving shortened URIs …

How should a search engine handle a shortened URI?

Handling an URI as shortened URL requires a bullet proof method to detect shortened URIs. That’s a breeze.

Redirect patterns: URI shorteners receive lots of external inbound links that get redirected to 3rd party sites. Linking pages, stopovers and destination pages usually reside on different domains. The method of redirection can vary. Most URI shorteners perform 301 redirects, some use 302 or 307 HTTP response codes, some frame the destination page displaying ads on the top frame, and I’ve seen even a few of them making use of meta refreshs and client sided redirects. Search engines can detect all those procedures.
Link appearance: redirecting URIs that belong to URI shorteners often appear on pages and in feeds hosted by social media services (Twitter, Facebook & Co).
Seed: trusted sources like LongURL.org provide lists of domains owned by URI shortening services. Social media outlets providing their own URI shorteners don’t hide server name patterns (like su.pr …).
Self exposure: the root index pages of URI shorteners, as well as other pages on those domains that serve a 200 response code, usually mention explicit terms like “shorten your URL” et cetera.
URI length: the length of an URI string, if less or equal 20 characters, is an indicator at most, because some URI shortening services offer keyword rich short URIs, and many sites provide natural URIs this short.

Search engine crawlers bouncing at short URIs should do a lookup, following the complete chain of redirects. (Some whacky services shorten everything that looks like an URI, even shortened URIs, or do a lookup themselves replacing the original short URI with another short URI that they can track. Yup, that’s some crazy insanity.)

Each and every stopover (shortened URI) should get indexed as an alias of the destination page, but must not appear on SERPs unless the search query contains the short URI or the destination URI (that means not on [site:tinyurl.com] SERPs, but on a [site:tinyurl.com shortURI] or a [destinationURI] search result page). 3rd party stopovers mustn’t gain reputation (PageRank™, anchor text, or whatever), regardless the method of redirection. All the link juice belongs to the destination page.

In other words: search engines should make use of their knowledge of shortened URIs in response to navigational search queries. In fact, search engines could even solve the problem of vanished and abused short URIs.

Now let’s see how major search engines handle shortened URIs, and how they could improve their SERPs.

Bing doesn’t get redirects at all

Bing 301 messed up SERPs Oh what a mess. The candidate from Redmond fails totally on understanding the HTTP protocol. Their search index is flooded with a bazillion of URI-only listings that all do a 301 redirect, more than 200,000 from tinyurl.com alone. Also, you’ll find URIs that do a permanent redirect and have nothing to do with URI shortening in their index, too.

I can’t be bothered with checking what Bing does in response to other redirects, since the 301 test fails so badly. Clicking on their first results for [site:tinyurl.com], I’ve noticed that many lead to mailto://working-email-addy type of destinations. Dear Bing, please remove those search results as soon as possible, before anyone figures out how to use your SERPs/APIs to launch massive email spam campaigns. As for tips on how to improve your short-URI-SERPs, please learn more under Yahoo and Google.

Yahoo does an awesome job, with a tiny exception

Yahoo has done a better job. They index short URIs and show the destination page, at least via their site explorer. When I search for a tinyURL, the SERP link points to the URI shortener, that could get improved by linking to the destination page.

By the way, Yahoo is the only search engine that handles abusive short-URIs totally right (I will not elaborate on this issue, so please don’t ask for detailled information if you’re not a SE engineer). Yahoo bravely passed the 301 test, as well as others (including pretty evil tactics). I so hope that MSN will adopt Yahoo’s bright logic before Bing overtakes Yahoo search. By the way, that can be accomplished without sending out spammy bots (hint2bing).

Google does it by the book, but there’s room for improvements

Google fails with merits As for tinyURLs, Google indexes only pages on the tinyurl.com domain, including previews. Unfortunately, the snippets don’t provide a link to the destination page. Although that’s the expected behavior (those URIs aren’t linked on the crawled page), that’s sad. At least Google didn’t fail on the 301 test.

As for the somewhat evil tactis I’ve applied in my tests so far, Google fell in love with some abusive short-URIs. Google -under particular circumstances- indexes shortened URIs that game Googlebot, having sent SERP traffic to sneakily shortened URIs (that face the searcher with huge ads) instead of the destination page. Since I’ve begun to deploy sneaky sURLs, Google greatly improved their spam filters, but they’re not yet perfect.

Since Google is responsible for most of this planet’s SERP traffic, I’ve put better sURL handling at the very top of my xmas wish list.

About abusive short URIs

Shortened URIs do poison the Internet. They vanish, alter their destination, mislead surfers … in other words they are abusive by definition. There’s no such thing as a persistent short URI!

Long time ago Tim Berners-Lee told you that ~~URI shorteners are evil~~ fucking with URIs is a very bad habit. Did you listen? Do you make use of shortened URIs? If you post URIs that get shortened at Twitter, or if you make use of 3rd party URI shorteners elsewhere, consider yourself trapped into a low-life traffic theft scam. Shame on you, and shame on Twitter & Co.

fight evil URI shorteners Besides my somewhat shady experiments that hijacked URIs, stole SERP positions, and converted “borrowed” SERP traffic, there are so many other ways to abuse shortened URIs. Many of them are outright evil. Many of them do hurt your kids, and mine. Basically, that’s not any search engine’s problem, but search engines could help us getting rid of the root of all sURL evil by handling shortened URIs with common sense, even when the last short URI has vanished.

Fight shortened URIs!

It’s up to you. Go stop it. As long as you can’t avoid URI shortening, roll your own URI shortener and make sure it can’t get abused. For the sake of our children, do not use or support 3rd party URI shorteners. Deprive the livelihood of these utterly useless scumbags.

Unfortunately, as a father and as a webmaster, I don’t believe in common sense applied by social media services. Hence, I see a “Twitter actively bypasses safe-search filters tricking my children into viewing hardcore porn” post coming. Dear Twitter & Co. — and that addresses all services that make use of or transport shortened URIs — put and end to shortened URIs. Now!

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

6 comments Sebastian | Social Web, MSN, URI shortening, Search Quality, Spam, Yahoo, Risky Linkage, Twitter, Google

Full disclosure @ FTC

Posted on 12 October, 2009

Trying to avoid an $11,000 fine in the Federal Trade Commission’s war on bloggers:

When I ~~write about~~ praise search engines, that’s totally paid-for because I’ve received free search results upfront.

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

1 comment Sebastian | MSN, Technorati, Blogging, Twitter, Yahoo, Crap, Google

Search engines should make shortened URIs somewhat persistent

Posted on 7 October, 2009

URI shorteners are crap. Each and every shortened URI expresses a design flaw. All -or at least most- public URI shorteners will shut down sooner or later, because shortened URIs are hard to monetize. Making use of 3rd party URI shorteners translates to “put traffic at risk”. Not to speak of link love (PageRank, Google juice, link popularity) lost forever.

SEs could rescue tiny URLs Search engines could provide a way out of the sURL dilemma that Twitter & Co created with their crappy, thoughtless and shortsighted software designs. Here’s how:

Most browsers support search queries in the address bar, as well as suggestions (aka search results) on DNS errors, and sometimes even 404s or other HTTP response codes other than 200/3x. That means browsers “ask a search engine” when an HTTP request fails.

When a TLD is out of service, search engines could have crawled a 301 or meta refresh from a page formerly living on a .yu domain for example. They know the new address and can lead the user to this (working) URI.

The same goes for shortened URIs created ages ago by URI shortening services that died in the meantime. Search engines have transferred all the link juice from the shortened URI to the destination page already, so why not point users that request a dead short URI to the right destination?

Search engines have all the data required for rescuing short URIs that are out of service in their datebases. Not de-indexing “outdated” URIs belonging to URI shorteners would be a minor tweak. At least Google has stored attributes and behavior of all links on the Web since the past century, and most probably other search engines are operated by data rats too.

URI shorteners can be identified by simple patterns. They gather tons of inbound links from foreign domains that get redirected (not always using a 301!) to URIs on other 3rd party domains. Of course that applies to some AdServers too, but rest assured search engines do know the differences.

So why the heck didn’t Google, ~~Yahoo/MSN~~ Bing, and Ask offer such a service yet? I thought it’s all about users, but I might have misread something. Sigh.

By the way, I’ve recorded search engine misbehavior with regard to shortened URIs that could arouse Jack The Ripper, but that’s a completely other story.

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

10 comments Sebastian | Social Web, Redirects, MSN, URI shortening, Twitter, Risky Linkage, Yahoo, SEO, Crap, Google

Save bandwidth costs: Dynamic pages can support If-Modified-Since too

Posted on 19 February, 2008

Conditional HTTP GET requests make Webmasters and Crawlers happy When search engine crawlers burn way too much of your bandwidth, this post is for you. Crawlers sent out by major search engines (Google, Yahoo and MSN/Live Search) support conditional GETs, that means they don’t fetch your pages if those didn’t change since the last crawl.

Of course they must fetch your stuff over and over again for this comparision, if your Web server doesn’t play nice with Web robots, as well as with other user agents that can cache your pages and other Web objects like images. The protocol your Web server and the requestors use to handle caching is quite simple, but its implementation can become tricky. Here is how it works:

1st request Feb/10/2008 12:00:00

Googlebot requests /some-page.php from your server. Since Google has just discovered your page, there are no unusual request headers, just a plain GET.

You create the page from a database record which was modified on Feb/09/2008 10:00:00. Your server sends Googlebot the full page (5k) with an HTTP header Date: Sun, 10 Feb 2008 12:00:00 GMT Last-Modified: Sat, 09 Feb 2008 10:00:00 GMT
(lets assume your server is located in Greenwich, UK), the HTTP response code is 200 (OK).

Bandwidth used: 5 kilobytes for the page contents plus less than 500 bytes for the HTTP header.

2nd request Feb/17/2008 12:00:00

Googlebot found interesting links pointing to your page, so it requests /some-page.php again to check for updates. Since Google already knows the resource, Googlebot requests it with an additional HTTP header If-Modified-Since: Sat, 09 Feb 2008 10:00:00 GMT
where the date and time is taken from the Last-Modified header you’ve sent in your response to the previous request.

You didn’t change the page’s record in the database, hence there’s no need to send the full page again. Your Web server sends Googlebot just an HTTP header Date: Sun, 17 Feb 2008 12:00:00 GMT Last-Modified: Sat, 09 Feb 2008 10:00:00 GMT
The HTTP response code is 304 (Not Modified). (Your Web server can suppress the Last-Modified header, because the requestor has this timestamp already.)

Bandwidth used: Less than 500 bytes for the HTTP header.

3rd request Feb/24/2008 12:00:00

Googlebot can’t resist to recrawl /some-page.php, again using the If-Modified-Since: Sat, 09 Feb 2008 10:00:00 GMT
header.

You’ve updated the database on Feb/23/2008 09:00:00 adding a few paragraphs to the article, thus you send Googlebot the full page (now 7k) with this HTTP header Date: Sun, 10 Feb 2008 12:00:00 GMT Last-Modified: Sat, 23 Feb 2008 09:00:00 GMT
and an HTTP response code 200 (OK).

Bandwidth used: 7 kilobytes for the page contents plus less than 500 bytes for the HTTP header.

Further requests

Provided you don’t change the contents again, all further chats between Googlebot and your Web server regarding /some-page.php will burn less than 500 bytes of your bandwidth each. Say Googlebot requests this page weekly, that’s 370k saved bandwidth annually. You do the math. Even with a medium-sized Web site you most likely want to implement proper caching, right?

Not only Webmasters love conditional GET requests that save bandwidth costs and processing time, search engines aren’t keen on useless data transfers too. So lets see how you could respond efficiently to conditional GET requests from search engines. Apache handles caching of static files (e.g. .txt or .html files you upload with FTP) differently from dynamic contents (script outputs with or without a query string in the URI).

Static files

Fortunately, Apache comes with native support of the Last-Modified / If-Modified-Since / Not-Modified functionality. That means that crawlers and your Web server don’t produce too much network traffic when a requested static file didn’t change since the last crawl.

You can test your Web server’s conditional GET support with your robots.txt, or, if even your robots.txt is a script, create a tiny HTML page with a text editor and upload it via FTP. Another neat tool to check HTTP headers is the Live Headers Extension for FireFox (bear in mind that testing crawler behavior with Web browsers is fault-prone by design).

If your second request of an unchanged static file results in a 200 HTTP response code, instead of a 304, call your hosting service. If it works and you’ve only static pages, then bookmark this article and move on.

Dynamic contents

Everything you output with server sided scripts is dynamic content by definition, regardless whether the URI has a query string or not. Even if you just read and print out a static file -that never changes- with PHP, Apache doesn’t add the Last-Modified header which forces crawlers to perform further requests with an If-Modified-Since header.

With dynamic content you can’t rely on Apache’s caching support, you must do it yourself.

The first step is figuring out where your CMS or eCommerce software hides the timestamps telling you the date and time of a page’s last modification. Usually a script pulls its stuff from different database tables, hence a page contains more than one area, or block, of dynamic contents. Every block might have a different last-modified timestamp, but not every block is important enough to serve as the page’s determinant last-modified date. The same goes for templates. Most template tweaks shouldn’t trigger a full blown recrawl, but some do, for example a new address or phone number if such information is present on every page.

For example a blog has posts, pages, comments, categories and other data sources that can change the sidebar’s contents quite frequently. On a page that outputs a single post or page, the last-modified date is determined by the post, respectively its last comment. The main page’s last-modified date is the modified-timestamp of the most recent post, and the same goes for its paginated continuations. A category page’s last-modified date is determined by the category’s most recent post, and so on.

New posts can change outgoing links of older posts when you use plugins that list related posts and stuff like that. There are many more reasons why search engines should crawl older posts at least monthly or so. You might need a routine that changes a blog page’s last-modified timestamp for example when it is a date more than 30 days or so in the past. Also, in some cases it could make sense to have a routine that can reset all timestamps reported as last-modified date for particular site areas, or even the whole site.

If your software doesn’t populate last-modified attributes on changes of all entities, then snap at the chance to consider database triggers, stored procedures, respectively changes of your data access layer. Bear in mind that not all changes of a record must trigger a crawler cache reset. For example a table storing textual contents like articles or product descriptions usually has a number of attributes that don’t affect crawling, thus it should have an attribute last updated that’s changeable in the UI and serves as last-modified date in your crawler cache control (instead of the timestamp that’s changed automatically even on minor updates of attributes which are meaningless for HTML outputs).

Handling Last-Modified, If-Modified-Since, and Not-Modified HTTP headers with PHP/Apache

Below I provide example PHP code I’ve thrown together after midnight in a sleepless night, doped with painkillers. It doesn’t run on a production system, but it should get you started. Adapt it to your needs and make sure you test your stuff intensively. As always, my stuff comes as is without any guarantees.

First grab a couple helpers and put them in an include file you’ve available in all scripts. Since we deal with HTTP headers, you must not output anything before the logic that deals with conditional search engine requests, not even a single white space character, HTML DOCTYPE declaration …
View|hide PHP code. (If you’ve disabled JavaScript you can’t grab the PHP source code!)

<?php function unixTimestamp2HttpDate ($timestamp) { // converts a Unix timestamp to an HTTP date $httpDate = @gmdate("D, d M Y H:i:s", $timestamp) ." GMT"; return $httpDate; // $httpDate === FALSE on invalid input } // end function unixTimestamp2HttpDate function unixTimestamp2MySqlDatetime ($timestamp) { // converts a Unix timestamp to a MySQL datetime $mySqlDatetime = @date("Y-m-d H:i:s", $timestamp); return $mySqlDatetime; } // end function unixTimestamp2MySqlDatetime function date2UnixTimestamp ($date) { // converts any US English date format to a Unix timestamp $timestamp = @strtotime($date); if ($timestamp == -1 || $timestamp === FALSE) { return FALSE; } return $timestamp; } // end function date2UnixTimestamp function makeLastModifiedTimestamp ($timestamp) { // returns a quite safe last-modified-timestamp or the current time $hour = 60 * 60; // consider unsynchronized clocks, or // local time given as GMT and crap like that: $maxDeviant = 13 * $hour; $now = @strtotime("now"); $returnTs = $timestamp + $maxDeviant; if (intval($returnTs) > intval($now)) { return ($now - 5); } return $returnTs; } // end function makeLastModifiedTimestamp function getHttpRequestHeaders () { // returns all HTTP request headers or FALSE $headersArr = array(); if (function_exists("apache_request_headers")) { $headersArr = apache_request_headers(); } else { if (function_exists("getallheaders")) { $headersArr = getallheaders(); } } if (count($headersArr) > 0) { return $headersArr; } return FALSE; } // end function getHttpRequestHeaders function getIfModifiedSince () { // returns If-Modified-Since as Unix Timestamp or FALSE GLOBAL $_SERVER; $headersArr = getHttpRequestHeaders(); if ($headersArr !== FALSE && count($headersArr) > 0) { foreach ($headersArr as $header => $value) { if (stristr($header, "If-Modified-Since")) { $ifModSinceDate = explode(";", $value); $ifModifiedSinceDate = $ifModSinceDate[0]; } } } if (!isset($ifModifiedSinceDate) && isset($_SERVER["IF-MODIFIED-SINCE"]) && !empty($_SERVER["IF-MODIFIED-SINCE"]) ) { $ifModifiedSinceDate = $_SERVER["IF-MODIFIED-SINCE"]; } if (!isset($ifModifiedSinceDate)) { return FALSE; } return date2UnixTimestamp($ifModifiedSinceDate); } // end function getIfModifiedSince () ?>

In general, all user agents should support conditional GET requests, not only search engine crawlers. If you allow long lasting caching, which is fine with search engines that don’t need to crawl your latest Twitter message from your blog’s sidebar, you could leave your visitors with somewhat outdated pages if you serve them 304-Not-Modified responses too.

It might be a good idea to limit 304 responses to conditional GET requests from crawlers, when you don’t implement way shorter caching cycles for other user agents. The latter includes folks that spoof their user agent name as well as scrapers trying to steal your stuff masked as a legit spider. To verify legit search engine crawlers that (should) support conditional GET requests (from Google, Yahoo, MSN and Ask) you can grab my crawler detection routines here. Include them as well, then you can code stuff like that:

$isSpiderUA = checkCrawlerUA (); $isLegitSpider = checkCrawlerIP (__FILE__); if ($isSpiderUA && !$isLegitSpider) { @header("Thou shalt not spoof", TRUE, 403); exit; // make sure your 403-Forbidden ErrorDocument directive in // .htaccess points to a page that explains the issue! } if ($isLegitSpider) { // insert your code dealing with conditional GET requests }

Now that you’re sure that the requestor is a legit crawler from a major search engine, look at the HTTP request header it has submitted to your Web server.

// lookup the HTTP request header for a possible conditional GET $ifModifiedSinceTimestamp = getIfModifiedSince(); // if the request is not conditional, don’t send a 304 $canSend304 = FALSE; if ($ifModifiedSinceTimestamp !== FALSE) { $canSend304 = TRUE; // Tells the requestor that you’ve recognized the conditional GET $echoRequestHeader = "X-Requested-If-modified-since: " .unixTimestamp2HttpDate($ifModifiedSinceTimestamp); @header($echoRequestHeader, TRUE); }

You don’t need to echo the If-Modified-Since HTTP-date in the response header, but this custom header makes testing easier.

Next get the page’s actual last-modified date/time. Here is an (incomplete) code sample for a WordPress single post page.

// select the requested post's comment_count, post_modified and // post_date values, then: if ($wp_post_modified) { $lastModified = date2UnixTimestamp($wp_post_modified); } else { $lastModified = date2UnixTimestamp($wp_post_date); } if (intval($wp_comment_count) > 0) { // select last comment from the WordPress database, then: $lastCommentTimestamp = date2UnixTimestamp($wp_comment_date); if ($lastCommentTimestamp > $lastModified) { $lastModified = $lastCommentTimestamp; } }

The date2UnixTimestamp() function accepts MySQL datetime values as valid input. If you need to (re)write last-modified dates to a MySQL database, convert the Unix timestamps to MySQL datetime values with unixTimestamp2MySqlDatetime().

Your server’s clock isn’t necessarily synchronized with all search engines out there. To cover possible gaps you can use a last-modified timestamp that’s a little bit fresher than the actual last-modified date. In this example the timestamp reported to the crawler is last-modified + 13 hours, you can change the deviant in makeLastModifiedTimestamp().$lastModifiedTimestamp = makeLastModifiedTimestamp($lastModified);

If you compare the timestamps later on, and the request isn’t conditional, don’t run into the 304 routine.if ($ifModifiedSinceTimestamp === FALSE) { // make things equal if the request isn't conditional $ifModifiedSinceTimestamp = $lastModifiedTimestamp; }

You may want to allow a full fetch if the requestor’s timestamp is ancient, in this example older than one month. $tooOld = @strtotime("now") - (31 * 24 * 60 * 60); if ($ifModifiedSinceTimestamp < $tooOld) { $lastModifiedTimestamp = @strtotime("now"); $ifModifiedSinceTimestamp = @strtotime("now") - (1 * 24 * 60 * 60); }
Setting the last-modified attribute to yesterday schedules the next full crawl after this fetch in 30 days (or later, depending on the actual crawl frequency).

Finally respond with 304-Not-Modified if the page wasn’t remarkably changed since the date/time given in the crawler’s If-Modified-Since header. Otherwise send a Last-Modified header with a 200 HTTP response code, allowing the crawler to fetch the page contents. $lastModifiedHeader = "Last-Modified: " .unixTimestamp2HttpDate($lastModifiedTimestamp); if ($lastModifiedTimestamp < $ifModifiedSinceTimestamp && $canSend304) { @header($lastModifiedHeader, TRUE, 304); exit; } else { @header($lastModifiedHeader, TRUE); }

When you’re testing your version of this script with a browser, it will send a standard HTTP request, and your server will return a 200-OK. From your server’s response your browser should recognize the “Last-Modified” header, so when you reload the page the browser should send an “If-Modified-Since” header and you should get the 304 response code if Last-Modified > If-Modified-Since. However, judging from my experience such browser based tests of crawler behavior, respectively responses to crawler requests, aren’t reliable.

Test it with this MS tool instead. I’ve played with it for a while and it works great. With the PHP code above I’ve created a 200/304 test page
http://sebastians-pamphlets.com/tools/last-modified-yesterday.php
that sends a “Last-Modified: Yesterday” response header, and should return a 304-Not Modified HTTP response code when you request it with an “If-Modified-Since: Today+” header, otherwise it should respond with 200-OK (this version returns 200-OK only but tells when it would respond with a 304). You can use this URI with the MS-tool linked above to test HTTP requests with different If-Modified-Since headers.

Have fun and paypal me 50% of your savings.

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

12 comments Sebastian | MSN, Web development, Yahoo, Google

Update your crawler detection: MSN/Live Search announces msnbot/1.1

Posted on 12 February, 2008

msnbot/1.1 Fabrice Canel from Live Search announces significant improvements of their crawler today. The very much appreciated changes are:

HTTP compression

The revised msnbot supports gzip and deflate as defined by RFC 2616 (sections 14.11 and 14.39). Microsoft also provides a tool to check your server’s compression / conditional GET support. (Bear in mind that most dynamic pages (blogs, forums, …) will fool such tools, try it with a static page or use your robots.txt.)

No more crawling of unchanged contents

The new msnbot/1.1 will not fetch pages that didn’t change since the last request, as long as the Web server supports the “If-Modified-Since” header in conditional GET requests. If a page didn’t change since the last crawl, the server responds with 304 and the crawler moves on. In this case your Web server exchanges only a handful of short lines of text with the crawler, not the contents of the requested resource.

If your server isn’t configured for HTTP compression and conditional GETs, you really should request that at your hosting service for the sake of your bandwidth bills.

New user agent name

From reading server log files we know the Live Search bot as “msnbot/1.0 (+http://search.msn.com/msnbot.htm)”, or “msnbot-media/1.0″, “msnbot-products/1.0″, and “msnbot-news/1.0″. From now on you’ll see “msnbot/1.1“. Nathan Buggia from Live Search clarifies: “This update does not apply to all the other ‘msnbot-*’ crawlers, just the main msnbot. We will be updating those bots in the future”.

If you just check the user agent string for “msnbot” you’ve nothing to change, otherwise you should check the user agent string for both “msnbot/1.0″ as well as “msnbot/1.1″ before you do the reverse DNS lookup to identify bogus bots. MSN will not change the host name “.search.live.com” used by the crawling engine.

The announcement didn’t tell us whether the new bot will utilize HTTP/1.1 or not (MS and Yahoo crawlers, like other Web robots, still perform, respectively fake, HTTP/1.0 requests).

It looks like it’s no longer necessary to charge Live Search for bandwidth their crawler has burned. Jokes aside, instead of reporting crawler issues to [email protected], you can post your questions or concerns at a forum dedicated to MSN crawler feedback and discussions.

I’m quite nosy, so I just had to investigate what “there are many more improvements” in the blog post meant. I’ve asked Nathan Buggia from Microsoft a few questions.

Nate, thanks for the opportunity to talk crawling with you. Can you please reveal a few msnbot/1.1 secrets?

I’m glad you’re interested in our update, but we’re not yet ready to provide more details about additional improvements. However, there are several more that we’ll be shipping in the next couple months.

Fair enough. So lets talk about related topics.

Currently I can set crawler directives for file types identified by their extensions in my robots.txt’s msnbot section. Will you fully support wildcards (* and $ for all URI components, that is path and query string) in robots.txt in the foreseeable future?

This is one of several additional improvements that we are looking at today, however it has not been released in the current version of MSNBot. In this update we were squarely focused on reducing the burden of MSNBot on your site.

What can or should a Webmaster do when you seem to crawl a site way too fast, or not fast enough? Do you plan to provide a tool to reduce the server load, respectively speed up your crawling for particular sites?

We currently support the “crawl-delay” option in the robots.txt file for webmasters that would like to slow down our crawling. We do not currently support an option to increase crawling frequency, but that is also a feature we are considering.

Will msnbot/1.1 extract URLs from client sided scripts for discovery crawling? If so, will such links pass reputation?

Currently we do not extract URLs from client-side scripts.

Google’s last change of their infrastructure made nofollow’ed links completely worthless, because they no longer used those in their discovery crawling. Did you change your handling of links with a “nofollow” value in the REL attribute with this upgrade too?

No, changes to how we process nofollow links were not part of this update.

Nate, many thanks for your time and your interesting answers!

Related posts:

Official announcement - by Nathan Buggia, Live Search Webmaster Center Blog
MSNbot 1.1: Live Search Implements A More Efficient Crawl - by Vanessa Fox, Search Engine Land

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

7 comments Sebastian | Analytics, MSN, Crawler Directives, Cloaking, robots.txt, SEO

The hacker tool MSN-LiveSearch is responsible for brute force attacks

Posted on 1 February, 2008

401 = Private Property, keep out! A while ago I’ve staged a public SEO contest, asking whether the 401 HTTP response code prevents from search engine indexing or not.

Password protected site areas should be safe from indexing, because legit search engine crawlers do not submit user/password combos. Hence their try to fetch a password protected URL bounces with a 401 HTTP response code that translates to a polite “Authorization Required”, meaning “Forbidden unless you provide valid authorization”.

Experience of life and common sense tell search engines, that when a Webmaster protects content with a user/password query, this content is not available to the public. Search engines that respect Webmasters/site owners do not point their users to protected content.

Also, that makes no sense for the search engine. Searchers submitting a query with keywords that match a protected URL would be pissed when they click the promising search result on the SERP, but the linked site responds with an unfriendly “Enter user and password in order to access [title of the protected area]”, that resolves to a harsh error message because the searcher can’t provide such information, and usually can’t even sign up from the 401 error page¹.

Unfortunately, search results that contain URLs of password protected content are valuable tools for hackers. Many content management systems and payment processors that Webmasters use to protect and monetize their contents leave footprints in URLs, for example /members/. Even when those systems can handle individual URLs, many Webmasters leave default URLs in place that are either guessable or well known on the Web.

Developing a script that searches for a string like /members/ in URLs and then “tests” the search results with brute force attacks is a breeze. Also, such scripts are available (for a few bucks or even free) at various places. Without the help of a search engine that provides the lists of protected URLs, the hacker’s job is way more complicated. In other words, search engines that list protected URLs on their SERPs willingly support and encourage hacking, content theft, and DOS-like server attacks.

Ok, lets look at the test results. All search engines have casted their votes now. Here are the winners:

Google

Once my test was out, Matt Cutts from Google researched the question and told me:

My belief from talking to folks at Google is that 401/forbidden URLs that we crawl won’t be indexed even as a reference, so .htacess password-protected directories shouldn’t get indexed as long as we crawl enough to discover the 401. Of course, if we discover an URL but didn’t crawl it to see the 401/Forbidden status, that URL reference could still show up in Google.

Well, that’s exactly the expected behavior, and I wasn’t surprised that my test results confirm Matt’s statement. Thanks to Google’s BlitzIndexing™ Ms. Googlebot spotted the 401 so fast, that the URL never showed up on Google’s SERPs. Google reports the protected URL in my Webmaster Console account for this blog as not indexable.

Yahoo

Yahoo’s crawler Slurp also fetched the protected URL in no time, and Yahoo did the right thing too. I wonder whether or not that’s going to change if M$ buys Yahoo.

Ask

Ask’s crawler isn’t the most diligent Web robot out there. However, somehow Ask has managed not to index a reference to my password protected URL.

And here is the ultimate loser:

MSN LiveSearch

Oh well. Obviously MSN LiveSearch is a must have in a deceitful cracker’s toolbox:

MSN LiveSearch indexes password protected URLs

As if indexing references to password protected URLs wouldn’t be crappy enough, MSN even indexes sitemap files that are referenced in robots.txt only. Sitemaps are machine readable URL submission files that have absolute no value for humans. Webmasters make use of sitemap files to mass submit their URLs to search engines. The sitemap protocol, that MSN officially supports, defines a communication channel between Webmasters and search engines - not searchers, and especially not scrapers that can use indexed sitemaps to steal Web contents more easily. Here is a screen shot of an MSN SERP:

MSN LiveSearch indexes unlinked sitemaps files (MSN SERP)

All the other search engines got the sitemap submission of the test URL too, but none of them fell for it. Neither Google, Yahoo, nor Ask have indexed the sitemap file (they never index submitted sitemaps that have no inbound links by the way) or its protected URL.

Summary

All major search engines except MSN respect the 401 barrier.

Since MSN LiveSearch is well known for spamming, it’s not a big surprise that they support hackers, scrapers and other content thieves.

Of course MSN search is still an experiment, operating in a not yet ready to launch stage, and the big players made their mistakes in the beginning too. But MSN has a history of ignoring Web standards as well as Webmaster concerns. It took them two years to implement the pretty simple sitemaps protocol, they still can’t handle 301 redirects, their sneaky stealth bots spam the referrer logs of all Web sites out there in order to fake human traffic from MSN SERPs (MSN traffic doesn’t exist in most niches), and so on. Once pointed to such crap, they don’t even fix the simplest bugs in a timely manner. I mean, not complying to the HTTP 1.1 protocol from the last century is an evidence of incapacity, and that’s just one example.

Update Feb/06/2008: Last night I’ve received an email from Microsoft confirming the 401 issue. The MSN Live Search engineer said they are currently working on a fix, and he provided me with an email address to report possible further issues. Thank you, Nathan Buggia! I’m still curious how MSN Live Search will handle sitemap files in the future.

¹ Smart Webmasters provide sign up as well as login functionality on the page referenced as ErrorDocument 401, but the majority of all failed logins leave the user alone with the short hard coded 401 message that Apache outputs if there’s no 401 error document. Please note that you shouldn’t use a PHP script as 401 error page, because this might disable the user/password prompt (due to a PHP bug). With a static 401 error page that fires up on invalid user/pass entries or a hit on the cancel button, you can perform a meta refresh to redirect the visitor to a signup page. Bear in mind that in .htaccess you must not use absolute URLs (http://… or https://…) in the ErrorDocument 401 directive, and that on the error page you must use absolute URLs for CSS, images, links and whatnot because relative URIs don’t work there!

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

9 comments Sebastian | Testing, MSN, Search Quality, Crap, Yahoo, .htaccess, Google

My plea to Google - Please sanitize your REP revamps

Posted on 3 January, 2008

Standardization of REP tags as robots.txt directives

Google is confules on REP standards and robots.txt This draft is kinda request for comments for search engine staff and uber search geeks interested in the progress of Robots Exclusion Protocol (REP) standardization (actually, every search engine maintains their own REP standard). It’s based on/extends the robots.txt specifications from 1994 and 1996, as well as additions supported by all major search engines. Furthermore it considers work in progress leaked out from Google.

In the following I’ll try to define a few robots.txt directives that Webmasters really need.

Show Table of Contents

Currently Google experiments with new robots.txt directives, that is REP tags like “noindex” adapted for robots.txt. That’s a welcomed and brilliant move.

Unfortunately, they got it totally wrong, again. (Skip the longish explanation of the rel-nofollow fiasco and my rant on Google’s current robots.txt experiments.)

Google’s last try to enhance the REP by adapting a REP tag’s value in another level was a miserable failure. Not because crawler directives on link-level are a bad thing, the opposite is true, but because the implementation of rel-nofollow confused the hell out of Webmasters, and still does.

Rel-Nofollow or how Google abused standardization of Web robots directives for selfish purposes

Don’t get me wrong, an instrument to steer search engine crawling and indexing on link level is a great utensil in a Webmaster’s toolbox. Rel-nofollow just lacks granularity, and it was sneakily introduced for the wrong purposes.

Recap: When Google launched rel-nofollow in 2005, they promoted it as a tool to fight comment spam.

From now on, when Google sees the attribute (rel=”nofollow”) on hyperlinks, those links won’t get any credit when we rank websites in our search results. This isn’t a negative vote for the site where the comment was posted; it’s just a way to make sure that spammers get no benefit from abusing public areas like blog comments, trackbacks, and referrer lists.

Technically spoken, this translates to “search engine crawlers shall/can use rel-nofollow links for discovery crawling, but indexers and ranking algos processing links must not credit link destinations with PageRank, anchor text, nor other link juice originating from rel-nofollow links”. Rel=”nofollow” meant rel=”pass-no-reputation”.

All blog platforms implemented the beast, and it seemed that Google got rid of a major problem (gazillions of irrelevant spam links manipulating their rankings). Not so the bloggers, because the spammers didn’t bother to check whether a blog dofollows inserted links or not. Despite all the condomized links the amount of blog comment spam increased dramatically, since the spammers were forced to attack even more blogs in order to earn the same amount of uncondomized links from blogs that didn’t update to a software version that supported rel-nofollow.

Experiment failed, move on to better solutions like Akismet, captchas or ajax’ed comment forms? Nope, it’s not that easy. Google had a hidden agenda. Fighting blog comment spam was just a snake oil sales pitch, an opportunity to establish rel-nofollow by jumping on a popular band wagon. In 2005 Google had mastered the guestbook spam problem already. Devaluing comment links in well structured pages like blog posts is as easy as doing the same with guestbook links, or identifying affiliate links. In other words, when Google launched rel-nofollow, blog comment spam was definitely not a major search quality issue any more.

Identifying paid links on the other hand is not that easy, because they often appear as editorial links within the content. And that was a major problem for Google, a problem that they weren’t able to solve algorithmically without cooperation of all webmasters, site owners, and publishers. Google actually invented rel-nofollow to get a grip on paid links. Recently they announced that Googlebot no longer follows condomized links (pre-Bigdaddy Google followed condomized links and indexed contents discovered from rel-nofollow links), and their cold war on paid links became hot.

Of course the sneaky morphing of rel-nofollow from “pass no reputation” to a full blown “nofollow” is just a secondary theater of war, but without this side issue (with regard to REP standardization) Google would have lost, hence it was decisive for the outcome of their war on paid links.

To stay fair, Danny Sullivan said twice that rel-nofollow is Dave Winer’s fault, and Google as the victim is not to blame.

Rel-nofollow is settled now. However, I don’t want to see Google using their enormous power to manipulate the REP for selfish goals again. I wrote this rel-nofollow recap because probably, or possibly, Google is just doing it once more:

Google’s “Noindex: in robots.txt” experiment

Google supports a Noindex: directive in robots.txt. It seems Google’s Noindex: blocks crawling like Disallow:, but additionally prevents URLs blocked with Noindex: both from accumulating PageRank as well as from indexing based on 3rd party signals like inbound links.

This functionality would be nice to have, but accomplishing it with “Noindex” is badly wrong. The REP’s “Noindex” value without an explicit “Nofollow” means “crawl it, follow its links, but don’t list it on SERPs”. With pagel-level directives (robots meta tags and X-Robots-Tags) Google handles “Noindex” exactly as defined, that means with an implicit “Follow”. Not so in robots.txt. Mixing crawler directives (Disallow:) with indexer directives (Noindex:) this way takes the “Follow” out of the game, because a search engine can’t follow links from uncrawled documents.

Webmasters will not understand that “Nofollow” means totally different things in robots.txt and meta tags. Also, this approach steals granularity that we need, for example for use with technically structured sitemap pages and other hubs.

According to Google their current interpretation of Noindex: in robots.txt is not yet set in stone. That means there’s an opportunity for improvement. I hope that Google, and other search engines as well, listen to the needs of Webmasters.

Dear Googlers, don’t take the above said as Google bashing. I know, and often wrote, that Google is the search engine that puts the most efforts in boring tasks like REP evolvement. I just think that a dog company like Google needs to take real-world Webmasters into the boat when playing with standards like the REP, for the sake of the cats.

Recap: Existing robots.txt directives

The /path example in the following sections refers to any way to assign URIs to REP directives, not only complete URIs relative to the server’s root. Patterns can be useful to set crawler directives for a bunch of URIs:

*: any string in path or query string, including the query string delimiter “?”, multiple wildcards should be allowed.
$: end of URI
Trailing /: (not exactly a pattern) addresses a directory, its files and subdirectories, the subdirectorie’s files etc., for example
- Disallow: /path/
  matches /path/index.html but not /path.html
- /path
  matches both /path/index.html and /path.html, as well as /path_1.html. It’s a pretty common mistake to “forget” the trailing slash in crawler directives meant to disallow particular directories. Such mistakes can result in blocking script/page-URIs that should get crawled and indexed.

Please note that patterns aren’t supported by all search engines, for example MSN supports only file extensions (yet?).

User-agent: [crawler name]
Groups a set of instructions for a particular crawler. Crawlers that find their own section in robots.txt ignore the User-agent: * section that addresses all Web robots. Each User-agent: section must be terminated with at least one empty line.

Disallow: /path
Prevents from crawling, but allows indexing based on 3rd party information like anchor text and surrounding text of inbound links. Disallow’ed URLs can gather PageRank.

Allow: /path
Refines previous Disallow: statements. For example Disallow: /scripts/ Allow: /scripts/page.php
tells crawlers that they may fetch http://example.com/scripts/page.php or http://example.com/scripts/page.php?article=1, but not any other URL in http://example.com/scripts/.

Sitemap: [absolute URL]
Announces XML sitemaps to search engines. Example: Sitemap: http://example.com/sitemap.xml Sitemap: http://example.com/video-sitemap.xml
points all search engines that support Google’s Sitemaps Protocol to the sitemap locations. Please note that sitemap autodiscovery via robots.txt doesn’t replace sitemap submissions. Google, Yahoo and MSN provide Webmaster Consoles where you not only can submit your sitemaps, but follow the indexing process (wishful thinking WRT particular SEs). In some cases it might be a bright idea to avoid the default file name “sitemap.xml” and keep the sitemap URLs out of robots.txt, sitemap autodiscovery is not for everyone.

Recap: Existing REP tags

REP tags are values that you can use in a page’s robots meta tag and X-Robots-Tag. Robots meta tags go to the HTML document’s HEAD section <meta name="robots" content="noindex, follow, noarchive" />
whereas X-Robots-Tags supply the same information in the HTTP header X-Robots-Tag: noindex, follow, noarchive
and thus can instruct crawlers how to handle non-HTML resources like PDFs, images, videos, and whatnot.

Widely supported REP tags are:

INDEX|NOINDEX - Tells whether the page may be indexed (listed on SERPs) or not
FOLLOW|NOFOLLOW - Tells whether crawlers may follow links provided in the document or not
ALL|NONE - ALL = INDEX, FOLLOW (default), NONE = NOINDEX, NOFOLLOW
NOODP - tells search engines not to use page titles and descriptions pulled from DMOZ on their SERPs.
NOYDIR - tells Yahoo! search not to use page titles and descriptions from the Yahoo! directory on the SERPs.
NOARCHIVE - Google specific, used to prevent archiving (cached page copy)
NOSNIPPET - Prevents Google from displaying text snippets for your page on the SERPs
UNAVAILABLE_AFTER: RFC 850 formatted timestamp - Removes an URL from Google’s search index a day after the given date/time

Problems with REP tags in robots.txt

REP tags (index, noindex, follow, nofollow, all, none, noarchive, nosnippet, noodp, noydir, unavailable_after) were designed as page-level directives. Setting those values for groups of URLs makes steering search engine crawling and indexing a breeze, but also comes with more complexity and a few pitfalls as well.

Page-level directives are instructions for indexers and query engines, not crawlers. A search engine can’t obey REP tags without crawling the resource that supplies them. That means that not a single REP tag put as robots.txt statement shall be misunderstood as crawler directive.

For example Noindex: /path must not block crawling, not even in combination with Nofollow: /path, because there’s still the implicit “archive” (= absence of Noarchive: /path). Providing a cached copy even of a not indexed page makes sense for toolbar users.

Whether or not a search engine actually crawls a resource that’s tagged with “noindex, nofollow, noarchive, nosnippet” or so is up to the particular SE, but none of those values implies a Disallow: /path.
Historically, a crawler instruction on HTML element level overrules the robots meta tag. For example when the meta tag says “follow” for all links on a page, the crawler will not follow a link that is condomized with rel=”nofollow”.

Does that mean that a robots meta tag overrules a conflicting robots.txt statement? Of course not in any case. Robots.txt is the gatekeeper, and so to say the “highest REP instance”. Actually, to this question there’s no absolute answer that satisfies everybody.

A Webmaster sitting on a huge conglomerate of legacy code may want to totally switch to robots.txt directives, that means search engines shall ignore all the BS in ancient meta tags of pages created in the stone age of the Internet. Back then the rules were different. An alternative/secondary landing page’s “index,follow” from 1998 most probably doesn’t fly with 2008’s duplicate content filters and high sophisticated link pattern analytics.

The Webmaster of a well designed brand new site on the other hand might be happy with a default behavior where page-level REP tags overrule site-wide directives in robots.txt.
REP tags used in robots.txt might refine crawler directives. For example a disallow’ed URL can accumulate PageRank, and may be listed on SERPs. We need at least two different directives ruling PageRank caluculation and indexing for uncrawlable resources (see below under Noodp:/Noydir:, Noindex: and Norank:).

Google’s current approach to handle this with the Noindex: directive alone is not acceptable, we need a new REP tag to handle this case. Next up, when we introduce a new REP tag for use in robots.txt, we should allow it in meta tags and HTTP headers too.
In theory it makes no sense to maintain a directive that describes a default behavior. But why has the REP “follow” although the absence of “nofollow” perfectly expresses “follow”? Because of the way non-geeks think (try to explain why the value nil/null doesn’t equal empty/zero/blank to a non-geek. Not!).

Implicit directives that aren’t explicitely named and described in the rules don’t exist for the masses. Even in the 10 commandments someone had to write “thou shalt not hotlink|scrape|spam|cloak|crosslink|hijack…” instead of a no-brainer like “publish unique and compelling content for people and make your stuff crawlable”. Unfortunately, that works the other way round too. If a statement (Index: or Follow:) is dependent on another one (Allow: respectively the absence of Disallow:) folks will whine, rant and argue when search engines ignore their stuff.

Obviously we need at least Index:, Follow: and Archive to keep the standard usable and somewhat understandable. Of course crawler directives might thwart such indexer directives. Ignorant folks will write alphabetically ordered robots.txt files like Disallow: /cgi-bin/ Disallow: /content/ ... Follow: /cgi-bin/redirect.php Follow: /content/links/ ... Index: /content/articles/
without Allow: /content/links/, Allow: /content/articles/ and Allow: /cgi-bin/redirect.

Whether or not indexer directives that require crawling can overrule the crawler directive Disallow: is open for discussion. I vote for “not”.
Applying REP tags on site-level would be great, but it doesn’t solve other problems like the need of directives on block and element level. Both Google’s section targeting as well as Yahoo’s robots-nocontent class name aren’t acceptable tools capable to instruct search engines how to handle content in particular page areas (advertising blocks, navigation and other templated stuff, links in footers or sidebar elements, and so on).

Instead of editing bazillions of pages, templates, include files and whatnot to insert rel-nofollow/nocontent stuff for the sole purpose of sucking up to search engines, we need an elegant way to apply such micro-directives via robots.txt, or at least site-wide sets of instructions referenced in robots.txt. Once that’s doable, Webmasters will make use of such tools to improve their rankings, and not alone to comply to the ever changing search engine policies that cost the Webmaster community billions of man hours each year.

I consider these robots.txt statements sexy: Nofollow a.advertising, div#adblock, span.cross-links: /path Noindex .inherited-properties, p#tos, p#privacy, p#legal: /path
but that’s a wish list for another post. However, while designing site-wide REP statements we should at least think of block/element level directives.

Remember the rel-nofollow fiasco where a REP tag was used on HTML element level producing so much confusion and conflicts. Lets learn from past mistakes and make it perfect this time. A perfect standard can be complex, but it’s clear and unambiguous.

Priority settings

The REP’s command hierarchy must be well defined:

robots.txt
Page meta tags and X-Robots-Tags in the HTTP header. X-Robots-Tag values overrule conflicting meta tag values.
[Future block level directives]
Element level directives like rel-nofollow

That means, when crawling is allowed, page level instructions overrule robots.txt, and element level (or future block level) directives overrule page level instructions as well as robots.txt. As long as the Webmaster doesn’t revert the latter:

Priority-page-level: /path
Default behavior, directives in robots meta tags overrule robots.txt statements. Necessary to reset previous Priority-site-level: statements.

Priority-site-level: /path
Robots.txt directives overrule conflicting directives in robots meta tags and X-Robots-Tags.

Priority-site-level All: /path
Robots.txt directives overrule all directives in robots meta tags or provided elsewhere, because those are completely ignored for all URIs under /path. The “All” parameter would even dofollow nofollow’ed links when the robots.txt lacks corresponding Nofollow: statements.

Noindex: /path

Follow outgoing links, archive the page, but don’t list it on SERPs. The URLs can accumulate PageRank etcetera. Deindex previously indexed URLs.

[Currently Google doesn’t crawl Noindex’ed URLs and most probably those can’t accumulate PageRank, hence URLs in /path can’t distribute PageRank. That’s plain wrong. Those URLs should be able to pass PageRank to outgoing links when there’s no explicit Nofollow:, nor a “nofollow” meta tag respectively X-Robots-Tag.]

Norank: /path

Prevents URLs from accumulating PageRank, anchor text, and whatever link juice.

Makes sense to refine Disallow: statements in company with Noindex: and Noodp:/Noydir:, or to prevent TOS/contact/privacy/… pages and alike from sucking PageRank (nofollow’ing TOS links and stuff like that to control PageRank flow is fault-prone).

Nofollow: /path

The uber-link-condom. Don’t use outgoing links, not even internal links, for discovery crawling. Don’t credit the link destinations with any reputation (PageRank, anchor text, and whatnot).

Noarchive: /path

Don’t make a cached copy of the resource available to searchers.

Nosnippet: /path

List the resource with linked page title on SERPs, but don’t create a text snippet, and don’t reprint the description meta tag.

[Why don’t we have a REP tag saying “use my description meta tag or nothing”?]

Nopreview: /path

Don’t create/link an HTML preview of this resource. That’s interesting for subscriptions sites and applies mostly to PDFs, Word documents, spread sheets, presentations, and other non-HTML resources. More information here.

Noodp: /path

Don’t use the DMOZ title nor the DMOZ description for this URL on SERPs, not even when this resource is a non-HTML document that doesn’t supply its own title/meta description.

Noydir: /path

I’m not sure this one makes sense in robots.txt, because only Yahoo search uses titles and descriptions from the Yahoo directory. Anyway: “Don’t overwrite the page title listed on the SERPs with information pulled from the Yahoo directory, although I paid for it.”

Unavailable_after [date]: /path

Deindex the resource the day after [date]. The parameter [date] is put in any date or date/time format, if it lacks a timezone then GMT is assumed.

[Google’s RFC 850 obsession is somewhat weird. There are many ways to put a timestamp other than “25-Aug-2007 15:00:00 EST”.]

Truncate-variable [string|pattern]: /path

Truncate-value [string|pattern]: /path

In the search index remove the unwanted variable/value pair(s) from the URL’s query string and transfer PageRank and other link juice to the matching URL without those parameters. If this “bare URL” redirects, or is uncrawlable for other reasons, index it with the content pulled from the page with the more complex URL.

Regardless whether the variable name or the variable’s value matches the pattern, “Truncate_*” statements remove a complete argument from the query string, that is &variable=value. If after the (last) truncate operation the query string is empty, the querystring delimiter “?” (questionmark) must be removed too.

Order-arguments [charset]: /path

Sort the query strings of all dynamic URLs by variable name, then within the ordered variables by their values. Pick the first URL from each set of identical results as canonical URL. Transfer PageRank etcetera from all dupes to the canonical URL.

Lots of sites out there were developed by coders who are utterly challenged by all things SEO. Most Web developers don’t even know what URL canonicalization means. Those sites suffer from tons of URLs that all serve identical contents, just because the query string arguments are put in random order, usually inventing a new sequence for each script, function, or include file. Of course most search engines run high sophisticated URL canonicalization routines to prevent their indexes from too much duplicate content, but those algos can fail because every Web site is different.

I totally can resist to suggest a Canonical-uri /: /Default.asp statement that gathers all IIS default-document-URI maladies. Also, case issues shouldn’t get fixed with Case-insensitive-uris: / but by the clueless developers in Redmond.

Will all this come true?

Well, Google has silently started to support REP tags in robots.txt, it totally makes sense both for search engines as well as for Webmasters, and Joe Webmaster’s life would be way more comfortable having REP tags for robots.txt.

A better question would be “will search engines implement REP tags for robots.txt in a way that Webmasters can live with it?”. Although Google launched the sitemaps protocol without significant help from the Webmaster community, I strongly feel that they desperately need our support with this move.

Currently it looks like they will fuck up the REP, respectively the robots.txt standard, hence go grab your AdWords rep and choke her/him until s/he promises to involve Larry, Sergey, Matt, Adam, John, and the whole Webmaster Support Team for the sake of common sense and the worldwide Webmaster community. Thank you!

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

21 comments Sebastian | Robots Meta Tags, Web development, X-Robots-Tag, MSN, Crawler Directives, XML-Sitemaps, Microformats, Google, Yahoo, robots.txt, Nofollow

1 | 2 Next Page »

Archived posts from the 'MSN' Category

WTF - The official SEO standard, supported by Google, Yahoo & Bing

Subset WTFm — Word Targeting Funnel for medical terms

Subset WTFa — Word Targeting Funnel for acronyms

Subset WTFo — Word Targeting Funnel for offensive language

WTF, even more?

The anatomy of a smart robots.txt

Install

Customize

Test

Launch

The output of a smart robots.txt

Search engines hide important information from webmasters

All search engines obey the X-Canonical-URI HTTP header

How should a search engine handle a shortened URI?

Bing doesn’t get redirects at all

Yahoo does an awesome job, with a tiny exception

Google does it by the book, but there’s room for improvements

About abusive short URIs

Fight shortened URIs!

Static files

Dynamic contents

Handling Last-Modified, If-Modified-Since, and Not-Modified HTTP headers with PHP/Apache

Google

Yahoo

Ask

MSN LiveSearch

Summary

Standardization of REP tags as robots.txt directives

Rel-Nofollow or how Google abused standardization of Web robots directives for selfish purposes

Google’s “Noindex: in robots.txt” experiment

Recap: Existing robots.txt directives

Recap: Existing REP tags

Problems with REP tags in robots.txt

Priority settings

Noindex: /path

Norank: /path

Nofollow: /path

Noarchive: /path

Nosnippet: /path

Nopreview: /path

Noodp: /path

Noydir: /path

Unavailable_after [date]: /path

Truncate-variable [string|pattern]: /path

Truncate-value [string|pattern]: /path

Order-arguments [charset]: /path

Will all this come true?

Categories

Monthly Archives

Links

RSS Feeds