A while ago I’ve staged a public SEO contest, asking whether the 401 HTTP response code prevents from search engine indexing or not.
Password protected site areas should be safe from indexing, because legit search engine crawlers do not submit user/password combos. Hence their try to fetch a password protected URL bounces with a 401 HTTP response code that translates to a polite “Authorization Required”, meaning “Forbidden unless you provide valid authorization”.
Experience of life and common sense tell search engines, that when a Webmaster protects content with a user/password query, this content is not available to the public. Search engines that respect Webmasters/site owners do not point their users to protected content.
Also, that makes no sense for the search engine. Searchers submitting a query with keywords that match a protected URL would be pissed when they click the promising search result on the SERP, but the linked site responds with an unfriendly “Enter user and password in order to access [title of the protected area]”, that resolves to a harsh error message because the searcher can’t provide such information, and usually can’t even sign up from the 401 error page1.
Unfortunately, search results that contain URLs of password protected content are valuable tools for hackers. Many content management systems and payment processors that Webmasters use to protect and monetize their contents leave footprints in URLs, for example
/members/. Even when those systems can handle individual URLs, many Webmasters leave default URLs in place that are either guessable or well known on the Web.
Developing a script that searches for a string like
/members/ in URLs and then “tests” the search results with brute force attacks is a breeze. Also, such scripts are available (for a few bucks or even free) at various places. Without the help of a search engine that provides the lists of protected URLs, the hacker’s job is way more complicated. In other words, search engines that list protected URLs on their SERPs willingly support and encourage hacking, content theft, and DOS-like server attacks.
Ok, lets look at the test results. All search engines have casted their votes now. Here are the winners:
My belief from talking to folks at Google is that 401/forbidden URLs that we crawl won’t be indexed even as a reference, so .htacess password-protected directories shouldn’t get indexed as long as we crawl enough to discover the 401. Of course, if we discover an URL but didn’t crawl it to see the 401/Forbidden status, that URL reference could still show up in Google.
Well, that’s exactly the expected behavior, and I wasn’t surprised that my test results confirm Matt’s statement. Thanks to Google’s BlitzIndexing™ Ms. Googlebot spotted the 401 so fast, that the URL never showed up on Google’s SERPs. Google reports the protected URL in my Webmaster Console account for this blog as not indexable.
Yahoo’s crawler Slurp also fetched the protected URL in no time, and Yahoo did the right thing too. I wonder whether or not that’s going to change if M$ buys Yahoo.
Ask’s crawler isn’t the most diligent Web robot out there. However, somehow Ask has managed not to index a reference to my password protected URL.
And here is the ultimate loser:
Oh well. Obviously MSN LiveSearch is a must have in a deceitful cracker’s toolbox:
As if indexing references to password protected URLs wouldn’t be crappy enough, MSN even indexes sitemap files that are referenced in robots.txt only. Sitemaps are machine readable URL submission files that have absolute no value for humans. Webmasters make use of sitemap files to mass submit their URLs to search engines. The sitemap protocol, that MSN officially supports, defines a communication channel between Webmasters and search engines - not searchers, and especially not scrapers that can use indexed sitemaps to steal Web contents more easily. Here is a screen shot of an MSN SERP:
All the other search engines got the sitemap submission of the test URL too, but none of them fell for it. Neither Google, Yahoo, nor Ask have indexed the sitemap file (they never index submitted sitemaps that have no inbound links by the way) or its protected URL.
All major search engines except MSN respect the 401 barrier.
Since MSN LiveSearch is well known for spamming, it’s not a big surprise that they support hackers, scrapers and other content thieves.
Of course MSN search is still an experiment, operating in a not yet ready to launch stage, and the big players made their mistakes in the beginning too. But MSN has a history of ignoring Web standards as well as Webmaster concerns. It took them two years to implement the pretty simple sitemaps protocol, they still can’t handle 301 redirects, their sneaky stealth bots spam the referrer logs of all Web sites out there in order to fake human traffic from MSN SERPs (MSN traffic doesn’t exist in most niches), and so on. Once pointed to such crap, they don’t even fix the simplest bugs in a timely manner. I mean, not complying to the HTTP 1.1 protocol from the last century is an evidence of incapacity, and that’s just one example.
Update Feb/06/2008: Last night I’ve received an email from Microsoft confirming the 401 issue. The MSN Live Search engineer said they are currently working on a fix, and he provided me with an email address to report possible further issues. Thank you, Nathan Buggia! I’m still curious how MSN Live Search will handle sitemap files in the future.
1 Smart Webmasters provide sign up as well as login functionality on the page referenced as ErrorDocument 401, but the majority of all failed logins leave the user alone with the short hard coded 401 message that Apache outputs if there’s no 401 error document. Please note that you shouldn’t use a PHP script as 401 error page, because this might disable the user/password prompt (due to a PHP bug). With a static 401 error page that fires up on invalid user/pass entries or a hit on the cancel button, you can perform a meta refresh to redirect the visitor to a signup page. Bear in mind that in .htaccess you must not use absolute URLs (http://… or https://…) in the ErrorDocument 401 directive, and that on the error page you must use absolute URLs for CSS, images, links and whatnot because relative URIs don’t work there!
Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to Entries Comments All Comments