Fabrice Canel from Live Search announces significant improvements of their crawler today. The very much appreciated changes are:
- HTTP compression
The revised msnbot supports gzip and deflate as defined by RFC 2616 (sections 14.11 and 14.39). Microsoft also provides a tool to check your server’s compression / conditional GET support. (Bear in mind that most dynamic pages (blogs, forums, …) will fool such tools, try it with a static page or use your robots.txt.)
- No more crawling of unchanged contents
The new msnbot/1.1 will not fetch pages that didn’t change since the last request, as long as the Web server supports the “If-Modified-Since” header in conditional GET requests. If a page didn’t change since the last crawl, the server responds with 304 and the crawler moves on. In this case your Web server exchanges only a handful of short lines of text with the crawler, not the contents of the requested resource.
If your server isn’t configured for HTTP compression and conditional GETs, you really should request that at your hosting service for the sake of your bandwidth bills.
- New user agent name
From reading server log files we know the Live Search bot as “msnbot/1.0 (+http://search.msn.com/msnbot.htm)”, or “msnbot-media/1.0″, “msnbot-products/1.0″, and “msnbot-news/1.0″. From now on you’ll see “msnbot/1.1“. Nathan Buggia from Live Search clarifies: “This update does not apply to all the other ‘msnbot-*’ crawlers, just the main msnbot. We will be updating those bots in the future”.
If you just check the user agent string for “msnbot” you’ve nothing to change, otherwise you should check the user agent string for both “msnbot/1.0″ as well as “msnbot/1.1″ before you do the reverse DNS lookup to identify bogus bots. MSN will not change the host name “.search.live.com” used by the crawling engine.
The announcement didn’t tell us whether the new bot will utilize HTTP/1.1 or not (MS and Yahoo crawlers, like other Web robots, still perform, respectively fake, HTTP/1.0 requests).
It looks like it’s no longer necessary to charge Live Search for bandwidth their crawler has burned. Jokes aside, instead of reporting crawler issues to firstname.lastname@example.org, you can post your questions or concerns at a forum dedicated to MSN crawler feedback and discussions.
I’m quite nosy, so I just had to investigate what “there are many more improvements” in the blog post meant. I’ve asked Nathan Buggia from Microsoft a few questions.
Nate, thanks for the opportunity to talk crawling with you. Can you please reveal a few msnbot/1.1 secrets?
I’m glad you’re interested in our update, but we’re not yet ready to provide more details about additional improvements. However, there are several more that we’ll be shipping in the next couple months.
Fair enough. So lets talk about related topics.
Currently I can set crawler directives for file types identified by their extensions in my robots.txt’s msnbot section. Will you fully support wildcards (* and $ for all URI components, that is path and query string) in robots.txt in the foreseeable future?
This is one of several additional improvements that we are looking at today, however it has not been released in the current version of MSNBot. In this update we were squarely focused on reducing the burden of MSNBot on your site.
What can or should a Webmaster do when you seem to crawl a site way too fast, or not fast enough? Do you plan to provide a tool to reduce the server load, respectively speed up your crawling for particular sites?
We currently support the “crawl-delay” option in the robots.txt file for webmasters that would like to slow down our crawling. We do not currently support an option to increase crawling frequency, but that is also a feature we are considering.
Will msnbot/1.1 extract URLs from client sided scripts for discovery crawling? If so, will such links pass reputation?
Currently we do not extract URLs from client-side scripts.
Google’s last change of their infrastructure made nofollow’ed links completely worthless, because they no longer used those in their discovery crawling. Did you change your handling of links with a “nofollow” value in the REL attribute with this upgrade too?
No, changes to how we process nofollow links were not part of this update.
Nate, many thanks for your time and your interesting answers!
- Related posts:
- Official announcement - by Nathan Buggia, Live Search Webmaster Center Blog
- MSNbot 1.1: Live Search Implements A More Efficient Crawl - by Vanessa Fox, Search Engine Land
Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to Entries Comments All Comments