Sebastian's Pamphlets

Dump your self-banning CMS

Posted on 11 May, 2009

CMS developer's output: unusable dogshit When it comes to cluelessness [silliness, idiocy, stupidity … you name it], you can’t beat CMS developers. You really can’t. There’s literally no way to kill search engine traffic that the average content management system (CMS) developer doesn’t implement. Poor publishers, probably you suffer from the top 10 issues on my shitlist. Sigh.

Imagine you’re the proud owner of a Web site that enables logged-in users customizing the look & feel and whatnot. Here’s how your CMS does the trick:

Unusable user interface

The user control panel offers a gazillion of settings that can overwrite each and every CSS property out there. To keep the user-cp pages lean and fast loading, the properties are spread over 550 pages with 10 attributes each, all with very comfortable Previous|Next-Page navigation. Even when the user has choosen a predefined template, the CMS saves each property in the user table. Of course that’s necessary because the site admin could change a template in use.

Amateurish database design

Not only for this purpose each user tuple comes with 512 mandatory attributes. Unfortunately, the underlaying database doesn’t handle tables with more than 512 columns, so the overflow gets stored in an array, using the large text column #512.

Cookie hell

Since every database access is expensive, the login procedure creates a persistent cookie (today + 365 * 30) for each user property. Dynamic and user specific external CSS files as well as style-sheets served in the HEAD section could fail to apply, so all CMS scripts use a routine that converts the user settings into inline style directives like style="color:red; text-align:bolder; text-decoration:none; ...". The developer consults the W3C CSS guidelines to make sure that not a single CSS property is left out.

Excessive query strings

Actually, not all user agents handle cookies properly. Especially cached pages clicked from SERPs load with a rather weird design. The same goes for standard compliant browsers. Seems to depend on the user agent string, so the developer adds a if ($well_behaving_user_agent_string <> $HTTP_USER_AGENT) then [read the user record and add each property as GET variable to the URI’s querystring]) sanity check. Of course the $well_behaving_user_agent_string variable gets populated with a constant containing the developer’s ancient IE user agent, and the GET inputs overwrite the values gathered from cookies.

Even more sanitizing

Some unhappy campers still claim that the CMS ignores some user properties, so the developer adds a routine that reads the user table and populates all variables that previously were filled from GET inputs overwriting cookie inputs. All clients are happy now.

Covering robots

“Cached copy” links from SERPs still produce weird pages. The developer stumbles upon my blog and adds crawler detection. S/he creates a tuple for each known search engine crawler in the user table of her/his local database and codes if ($isSpider) then [select * from user where user.usrName = $spiderName, populating the current script's CSS property variables from the requesting crawler's user settings]. Testing the rendering with a user agent faker gives fine results: bug fixed. To make sure that all user agents get a nice page, the developer sets the output default to “printer”, which produces a printable page ignoring all user settings that assign style="display:none;" to superfluous HTML elements.

Results

Users are happy, they don’t spot the code bloat. But search engine crawlers do. They sneakily request a few pages as a crawler, and as a browser. Comparing the results they find the “poor” pages delivered to the feigned browser way too different from the “rich” pages serving as crawler fodder. The domain gets banned for poor-man’s-cloaking (as if cloaking in general could be a bad thing, but that’s a completely different story). The publisher spots decreasing search engine traffic and wonders why. No help avail from the CMS vendor. Must be unintentionally deceptive SEO copywritig or so. Crap. That’s self-banning by software design.

Ok, before you read on: get a calming tune.

How can I detect a shitty CMS?

Well, you can’t, at least not as a non-geeky publisher. Not really. Of course you can check the “cached copy” links from your SERPs all night long. If they show way too different results compared to your browser’s rendering you’re at risk. You can look at your browser’s address bar to check your URIs for query strings with overlength, and if you can’t find the end of the URI perhaps you’re toast, search engine wise. You can download tools to check a page’s cookies, then if there are more than 50 you’re potentially search-engine-dead. Probably you can’t do a code review yourself coz you can’t read source code natively, and your CMS vendor has delivered spaghetti code. Also, as a publisher, you can’t tell whether your crappy rankings depend on shitty code or on your skills as as a copywriter. When you ask your CMS vendor, usually the search engine algo is faulty (especially Google, Yahoo, Microsoft and Ask) but some exotic search engine from Togo or so sets the standards for state of the art search engine technology.

Last but not least, as a non-search-geek challenged by Web development techniques you won’t recognize most of the laughable -but very common- mistakes outlined above. Actually, most savvy developers will not be able to create a complete shitlist from my scenario. Also, there a tons of other common CMS issues that do resolve in different crawlability issues - each as bad as this one, or even worse.

Now what can you do? Well, my best advice is: don’t click on Google ads titled “CMS”, and don’t look at prices. The cheapest CMS will cost you the most at the end of the day. And if your budget exceeds a grand or two, then please hire an experienced search engine optimizer (SEO) or search savvy Web developer before you implement a CMS.

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

26 comments Sebastian | Blogging, Web development, Crap, Cloaking, SEO

Professional Twitter-Stalking

Posted on 21 April, 2009

Today Kelvin Newman asked me for a Twitter-tip. Well, I won’t reveal what he’s gathered so far until he publishes his collection, but I thought I could post a TwitterTip myself. I’m on a dead slow Internet connection, so here’s the KISS-guide to professional stalking on Twitter:

Collect RSS-Feed URIs

Every Twitter-user maintains an RSS feed. Regardless whether you can spot the RSS icon on her/his profile page or not. If there’s no public link to the feed, then click “view source”, scroll down to the RSS link element type="application/rss+xml"in the HEAD section, and scrape the URI from the HREF attribute. It should look like http://twitter.com/statuses/user_timeline/3220501.rss (that’s mine).

Merge the Feeds

Actually, I hate this service coz they apply nofollow-toxin to my links, but it’s quite easy to use and reliable (awful slow in design mode, though). So, (outch) go to Yahoo Pipes, sign in with any Yahoo-ID you’ve not yet burned with spammy activties, and click on “Create New Pipe”.

Grab a “Fetch Feed” element and insert your collected RSS-URIs. You can have multiple feed-suckers in a pipe, for example one per stalked Twitter user, or organize your idols in groups. In addition to the Twitter-feed you could add her/his blog-feed, and last.fm or you-porn stuff as well to get the big picture.

Create a “Union” element from the “operator” menu and connect all your feed-suckers to the merger.

Next create a “Sort” element and connect it to the merger. Sort by date of publication in descending order to get the latest tweets at the top. Bear in mind that feeds aren’t real time. When you subscribe later on, you’ll miss out on the latest tweets, but your feed reader will show you even deleted updates.

Finally connect the sorter to the outputter and save the whole thingy. Click on “Run Pipe” or the debugger to preview the results.

Here’s how such a stalker tool gets visualized in Yahoo Pipes:

Subscribe and Enjoy

On the “Run Feed” page Yahoo shows the pipe’s RSS-URI, e.g. http://pipes.yahoo.com/pipes/pipe.info?_id=_rEQorAu3hGQVK9z3nBDOQ. You can prettify this rather ugly address if you prefer talking URIs.

Copy the pipe’s URI and subscribe with your preferred RSS reader. Enjoy.

Thou shalt not stalk me!

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

4 comments Sebastian | Twitter, Yahoo

About the bad taste of shameless ego food

Posted on 30 January, 2009

Seems I’ve made it on the short-list at the SEMMY 2009 Awards in the Search Tech category. Great ego food. I’m honored. Thanks for nominating me that often! And thanks to John Andrews and Todd Mintz for the kind judgement!

Now that you’ve read the longish introduction, why not click here and vote for my pamphlet?

Ok Ok Ok, it’s somewhat technically and you perhaps even consider it plain geek food. However, it’s hopefully / nevertheless useful for your daily work. BTW … I wish more search engine engineers would read it. It could help them to tidy up their flawed REP support.

Does this post smell way too selfish? I guess it does, but I’ll post it nonetheless coz I’m fucking keen on your votes. Thanks in advance!

Wow, I won! Thank you all!

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

2 comments Sebastian | X-Robots-Tag, Ego Food, URL removal, Robots Meta Tags, robots.txt, SEO

Crawling vs. Indexing

Posted on 20 October, 2008

Sigh. I just have to throw in my 2 cents.

Crawling means sucking content without processing the results. Crawlers are rather dumb processes that fetch content supplied by Web servers answering (HTTP) requests of requested URIs, delivering those contents to other processes, e.g. crawling caches or directly to indexers. Crawlers get their URIs from a crawling engine that’s feeded from different sources, including links extracted from previously crawled Web documents, URI submissions, foreign Web indexes, and whatnot.

Indexing means making sense out of the retrieved contents, storing the processing results in a (more or less complex) document index. Link analysis is a way to measure URI importance, popularity, trustworthiness and so on. Link analysis is often just a helper within the indexing process, sometimes the end in itself, but traditionally a task of the indexer, not the crawler (high sophisticated crawling engines do use link data to steer their crawlers, but that has nothing to do with link analysis in document indexes).

A crawler directive like “disallow” in robots.txt can direct crawlers, but means nothing to indexers.

An indexer directive like “noindex” in an HTTP header, an HTML document’s HEAD section, or even a robots.txt file, can direct indexers, but means nothing to crawlers, because the crawlers have to fetch the document in order to enable the indexer to obey those (inline) directives.

So when a Web service offers an indexer directive like <meta name="SEOservice" content="noindex" /> to keep particular content out of its index, but doesn’t offer a crawler directive like User-agent: SEOservice Disallow: /, this Web service doesn’t crawl.

That’s not about semantics, that’s about Web standards.

Whether or not such a Web service can come with incredible values squeezed out of its index gathered elsewhere, without crawling the Web itself, is a completely different story.

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

14 comments Sebastian | X-Robots-Tag, Robots Meta Tags, Crawler Directives, robots.txt, SEO

Still not yet speechless, just swamped

Posted on 30 June, 2008

Sebastian swamped Long time no blogging … sorry folks. I’m swamped in a huge project that has nothing to do with SEO, and not much with webmastering at all. I’m dealing with complex backend systems and all my script outputs go to a closed user group, so I can’t even blog a new SEO finding or insight every now and then. Ok, except experiences like “Google Maps Premier: ‘organizations need more‘ … well … contact to a salesman within days, not months or years … and of course prices on the Web site”.

However, it’s an awesome experience to optimize business processes that are considered extremely painful in most companies out there. Time recording, payroll accounting, reimbursing of traveling expenses, project controlling and invoicing of time and material in complex service projects is a nightmare that requires handling of shitloads of paper, importing timesheets from spreadsheets, emails and whatnot, … usually. No longer. Compiling data from cellphones, PDAs, blackberries, iPhones, HTML forms, somewhat intelligent time clocks and so on in near real time is a smarter way to build the data pool necessary for accounting and invoicing, and allows fully automated creation of travel expense reports, payslips, project reports and invoices with a few mouse clicks in your browser. If you’re interested, drop me a line and I’ll link you to the startup company I’m working for.

Oh well, I’ve got a long list of topics I wanted to blog, but there’s no time left because I consider my cute monsters more important than blogging and such stuff. For example, I was going to write a pamphlet about Technorati’s spam algos (do not ping too many of your worst enemy’s URLs too often because that’ll ban her/his blog), Google’s misunderstanding of the Robots Exclusion Protocol (REP) (crawler directives like “disallow” in robots.txt do not forbid search engine indexing - the opposite is true), or smart ways to deal with unindexable URIs that contain .exe files when you’re using tools like Progress WebSpeed on Windows boxes with their default settings (hint: Apache’s script alias ends your pain). Unfortunately, none of these posts will be written (soon). Anywayz, I’ll try to update you more often, but I can’t promise anything like that in the near future. Please don’t unsubscribe, I’ll come back to SEO topics. As for the comments, I’m still deleting all “thanks” and “great post” stuff linked to unusual URIs I’m not familiar with. As usual.

All the best!
Sebastian

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

6 comments Sebastian | Technorati, Blogging, robots.txt, Google

You can’t escape from Google-Jail when …

Posted on 27 February, 2008

spammers stuck in google jail … you’ve boosted your business Web site’s rankings with shitloads of crappy links. The 11th SEO commandment: Don’t promote your white hat sites with black hat link building methods! It may work for a while, but once you find your butt in Google-jail, there’s no way out. Not even a reconsideration request can help because you can’t provide its prerequisites.

When you’re caught eventually -penalized for tons of stinky links- and have to file a reinclusion request, Google wants you to remove all the shady links you’ve spread on the Web before they lift your penalty. Here is an example, well documented in a Google Groups thread started by a penalized site owner with official statements from Matt Cutts and John Müller from Google.

The site in question, a small family business from the UK, has used more or less every tactic from a lazy link builder’s textbook to create 40,000+ inbound links. Sponsored WordPress themes, paid links, comment spam, artificial link exchanges and whatnot.

Most sites that carry these links are in no way related to the penalized site, which deals with modern teak garden furniture and home furniture sets, for example porn galleries, Web designers, US city guides, obscure oriental blogs, job boards, or cat masturbation guides. (Don’t get me wrong. Of course not every link has to be topically related. Every link from a trusted page can pass PageRank, and can improve crawling, indexing, and so on.)

Google has absolutely no problem with unrelated links, unless a site’s link profile consists of way too many spammy and/or unrelated links. That does not mean that spreading a gazillion low-life links pointing to a competitor will get this site penalized or even banned. Negative SEO is not that simple. For an innocent site Google just ignores spammy inbound links, but most probably flags it for further investigations, both manually as well as algorithmically.

If on the other hand Google finds evidence that a site is actively involved in link monkey business of any kind, that’s a completely different story. Such evidence could be massively linking out to spammy places, hosting reciprocal links pages or FFA directories, unskillful (manual|automated) comment spam, signature links and mentions at places that trade links, textual contents made for (paid) link campaigns when reused too often, buying links from trackable services, (link request emails forwarded via) paid-link/spam reports, and so on.

Below is the “how to file a successful reconsideration request when your sins include link spam” from Googlers.

Matt Cutts:

The recommendation from your SEO guy led you directly into a pretty high-risk area; I doubt you really want pages like (NSAW) having sponsored links to your furniture site anyway. It’s definitely possible to extricate your site, but I would make an effort to contact the sites with your sponsored links and request that they remove the links, and then do a reconsideration request. Maybe in the text of your reconsideration request, I’d include a pointer to this thread as well.

John Müller:

You may want to consider what you can do to help clean up similar [=spammy] links on other people’s sites. Blogs and newspaper sites such as http://media.www.dailypennsylvanian.com sometimes receive short comments such as “dont agree”, apparently only for a link back to a site. These comments often use keywords from that site instead of a user name, perhaps “tree bench” for a furniture site or “sexy shoes” for a footwear site. If this kind of behavior might have taken place for your site, you may want to work on rectifying it and include some information on it in your reconsideration request. Given your situation, the person considering your reconsideration request might be curious about links like that.

Translation: We’ll ignore your weekly reconsideration requests unless you’ve removed all artificial links pointing to your site. You’re stuck in Google’s dungeon because they’ve thrown away the keys.

I’d guess that for a site that has filed a reinclusion request stating the site was involved in some sort of link monkey business, Google applies a more strict policy than with a site that was attacked by negative SEO methods. I highly doubt that when caught red-handed a lame excuse like “I didn’t create those links” is a tactic I could recommend, because Googlers hate it when an applicant lies in a reinclusion request.

Once caught and penalized, the “since when do inbound links count as negative votes” argument doesn’t apply. It’s quite clear that removing the traces (admitted as well as not admitted shady links) is a prerequisite for a penalty lift. And that even though Google has already discounted these links. That’s the same as with penalized doorway pages. Redirecting doorways to legit landing pages doesn’t count, Google wants to see a 410-Gone HTTP response code (or at least a 404) before they un-penalize a site.

I doubt that’s common knowledge to folks who promote their white hat sites with black hat methods. Getting links wiped out at places that didn’t check the intention of inserted links in the first place is a royal PITA, in other words, it’s impossible to get all shady links removed once you find your butt in Google-jail. That’s extremely uncomfortable for site owners who fell for questionable forum advice or hired a promotional service (no, I don’t call such assclowns SEOs) applying shady marketing methods without a clear and written warning that those are extremely risky, fully explained and signed by the client.

Maybe in some cases Google will un-penalize a great site although not all link spam was wiped out. However, the costs and efforts of preparing a successful resonsideration request are immense, not to speak of the massive loss of traffic and income.

As Barry mentioned, the thread linked above might be interesting for folks keen on an official confirmation that Google -60 penalties exist. I’d say such SERP penalties (aka red & yellow cards) aren’t exactly new, and it plays no role to which position a site penalized for guideline violations gets downranked. When I’ve lost a top spot for gaming Google, that’s kismet. I’m not interested in figuring out that 20k spammy links get me a -30 penalty, 40k shady links result in a -60 penalty, and 100k unnatural links qualify me for the famous -950 bashing (the numbers are made up of course). If I’d spam, then I’d just move on because I’d have already launched enough other projects to compensate the losses.

PS: While I was typing, Barry Schwartz posted his Google-Jail story at SE Roundtable.

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

45 comments Sebastian | Reciprocal Links, Webspam, Risky Linkage, Spam Report, SEO, Paid Links, Google

@ALL: Give Google your feedback on NOINDEX, but read this pamphlet beforehand!

Posted on 25 February, 2008

Dear Google, please respect NOINDEX Matt Cutts asks us How should Google handle NOINDEX? That’s a tough question worth thinking twice before you submit a comment to Matt’s post. Here is Matt’s question, all the background information you need, and my opinion.

What is NOINDEX?

Noindex is an indexer directive defined in the Robots Exclusion Protocol (REP) from 1996 for use in robots meta tags. Putting a NOINDEX value in a page’s robots meta tag or X-Robots-Tag tells search engines that they shall not index the page content, but may follow links provided on the page.

To get a grip on NOINDEX’s role in the REP please read my Robots Exclusion Protocol summary at SEOmoz. Also, Google experiments with NOINDEX as crawler directive in robots.txt, more on that later.

How major search engines treat NOINDEX

Of course you could read a ton of my pamphlets to extract this information, but Matt’s summary is still accurate and easier to digest:

[Matt Cutts on August 30, 2006]
Google doesn’t show the page in any way.

Ask doesn’t show the page in any way.

MSN shows a URL reference and cached link, but no snippet. Clicking the cached link doesn’t return anything.

Yahoo! shows a URL reference and cached link, but no snippet. Clicking on the cached link returns the cached page.

Personally, I’d prefer it if every search engine treated the noindex meta tag by not showing a page in the search results at all. [Meanwhile Matt might have a slightly different opinion.]

Google’s experimental support of NOINDEX as crawler directive in robots.txt also includes the DISALLOW functionality (an instruction that forbids crawling), and most probably URIs tagged with NOINDEX in robots.txt cannot accumulate PageRank. In my humble opinion the DISALLOW behavior of NOINDEX in robots.txt is completely wrong, and without any doubt in no way compliant to the Robots Exclusion Protocol.

Matt’s question: How should Google handle NOINDEX in the future?

To simplify Matt’s poll, lets assume he’s talking about NOINDEX as indexer directive, regardless where a Webmaster has put it (robots meta tag, X-Robots-Tag, or robots.txt).

The question is whether Google should completely drop a NOINDEX’ed page from our search results vs. show a reference to the page, or something in between?

Here are the arguments, or pros and cons, for each variant:

Google should completely drop a NOINDEX’ed page from their search results

Obviously that’s what most Webmasters would prefer:

This is the behavior that we’ve done for the last several years, and webmasters are used to it. The NOINDEX meta tag gives a good way — in fact, one of the only ways — to completely remove all traces of a site from Google (another way is our url removal tool). That’s incredibly useful for webmasters.

NOINDEX means don’t index, search engines must respect such directives, even when the content isn’t password protected or cloaked away (redirected or hidden for crawlers but not for visitors).

The corner case that Google discovers a link and lists it on their SERPs before the page that carries a NOINDEX directive is crawled and deindexed isn’t crucial, and could be avoided by a (new) NOINDEX indexer directive in robots.txt, which is requested by search engines quite frequently. Ok, maybe Google’s BlitzCrawler™ has to request robots.txt more often then.

Google should show a reference to NOINDEX’ed pages on their SERPs

Search quality and user experience are strong arguments:

Our highest duty has to be to our users, not to an individual webmaster. When a user does a navigational query and we don’t return the right link because of a NOINDEX tag, it hurts the user experience (plus it looks like a Google issue). If a webmaster really wants to be out of Google without even a single trace, they can use Google’s url removal tool. The numbers are small, but we definitely see some sites accidentally remove themselves from Google. For example, if a webmaster adds a NOINDEX meta tag to finish a site and then forgets to remove the tag, the site will stay out of Google until the webmaster realizes what the problem is. In addition, we recently saw a spate of high-profile Korean sites not returned in Google because they all have a NOINDEX meta tag. If high-profile sites like [3 linked examples] aren’t showing up in Google because of the NOINDEX meta tag, that’s bad for users (and thus for Google).

Search quality and searchers’ user experience is also a strong argument for totally delisting NOINDEX’ed pages, because most Webmasters use this indexer directive to keep stuff that doesn’t provide value for searchers out of the search indexes. <polemic>I mean, how much weight have a few Korean sites when it comes to decisions that affect the whole Web?</polemic>

If a Webmaster puts a NOINDEX directive by accident, that’s easy to spot in the site’s stats, considering the volume of traffic that Google controls. I highly doubt that a simple URI reference with an anchor text scrubbed from external links on Google SERPs would heal such a mistake. Also, Matt said that Google could add a NOINDEX check to the Webmaster Console.

The reference to the URI removal tools is out of context, because these tools remove an URI only for a short period of time and all removal requests have to be resubmitted repeatedly every few weeks. NOINDEX on the other hand is a way to keep an URI out of the index as long as this crawler directive is provided.

I’d say the sole argument for listing references to NOINDEX’ed pages that counts is misleading navigational searches. Of course that does not mean that Google may ignore the NOINDEX directive to show -with a linked reference- that they know a resource, despite the fact that the site owner has strictly forbidden such references on SERPs.

Something in between, Google should find a reasonable way to please both Webmasters and searchers

Quoting Matt again:

The vast majority of webmasters who use NOINDEX do so deliberately and use the meta tag correctly (e.g. for parked domains that they don’t want to show up in Google). Users are most discouraged when they search for a well-known site and can’t find it. What if Google treated NOINDEX differently if the site was well-known? For example, if the site was in the Open Directory, then show a reference to the page even if the site used the NOINDEX meta tag. Otherwise, don’t show the site at all. The majority of webmasters could remove their site from Google, but Google would still return higher-profile sites when users searched for them.

Whether or not a site is popular must not impact a search engine’s respect for a Webmaster’s decision to keep search engines, and their users, out of her realm. That reads like “Hey, Google is popular, so we’ve the right to go to Mountain View to pillage the Googleplex, acquiring everything we can steal for the public domain”. Neither Webmasters nor search engines should mimic Robin Hood. Also, lots of Webmasters highly doubt that Google’s idea of (link) popularity should rule the Web.

Whether or not a site is listed in the ODP directory is definitely not an indicator that can be applied here. Last time I looked the majority of the Web’s content wasn’t listed at DMOZ due to the lack of editors and various other reasons, and that includes gazillions of great and useful resources. I’m not bashing DMOZ here, but as a matter of fact it’s not comprehensive enough to serve as indicator for anything, especially not importance and popularity.

I strongly believe that there’s no such thing as a criterion suitable to mark out a two class Web.

My take: Yes, No, Depends

Google could enhance navigational queries -and even “I feel lucky” queries- that lead to a NOINDEX’ed page with a message like “The best matching result for this query was blocked by the site”. I wouldn’t mind if they mention the URI as long as it’s not linked.

In fact, the problem is the granularity of the existing indexer directives. NOINDEX is neither meant for nor capable of serving that many purposes. It is wrong to assign DISALLOW semantics to NOINDEX, and it is wrong to create two classes of NOINDEX support. Fortunately, we’ve more REP indexer directives that could play a role in this discussion.

NOODP, NOYDIR, NOARCHIVE and/or NOSNIPPET in combination with NOINDEX on a site’s home page, that is either a domain or subdomain, could indicate that search engines must not show references to the URI in question. Otherwise, if no other indexer directives elaborate NOINDEX, search engines could show references to NOINDEX’ed main pages. The majority of navigational search queries should lead to main pages, so that would solve the search quality issues.

Of course that’s not precise enough due to the lack of a specific directive that deals with references to forbidden URIs, but it’s way better than ignoring NOINDEX in its current meaning.

A fair solution: NOREFERENCE

If I’d make the decision at Google and couldn’t live with a best matching search result blocked message, I’d go for a new REP tag:

“NOINDEX, NOREFERENCE” in a robots meta tag -respectively Googlebot meta tag- or X-Robots-Tag forbids search engines to show a reference on their SERPs. In robots.txt this would look like NOINDEX: / NOINDEX: /blog/ NOINDEX: /members/ … NOREFERENCE: / NOREFERENCE: /blog/ NOREFERENCE: /members/ …
Search engines would crawl these URIs, and follow their links as long as there’s no NOFOLLOW directive either in robots.txt or a page specific instruction.

NOINDEX without a NOREFERENCE directive would instruct search engines not to index a page, but allows references on SERPs. Supporting this indexer directive both in robots.txt as well as on-the-page (respectively in the HTTP header for X-Robots-Tags) makes it easy to add NOREFERENCE on sites that hate search engine traffic. Also, a syntax variant like NOINDEX=NOREFERENCE for robots.txt could tell search eniges how they have to treat NOINDEX statements on site level, or even on site area level.

Even more appealing would be NOINDEX=REFERENCE, because only the very few Webmasters that would like to see their NOINDEX’ed URIs on Google’s SERPs would have to add a directive to their robots.txt at all. Unfortunately, that’s not doable for Google unless they can convice three well known Korean sites to edit their robots.txt.

By the way, don’t miss out on my draft asking for REP tag support in robots.txt!

Anyway: Dear Google, please don’t touch NOINDEX!

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

6 comments Sebastian | URL removal, Search Quality, X-Robots-Tag, Robots Meta Tags, Crawler Directives, SEO, robots.txt, Google

Nofollow still means don’t follow, and how to instruct Google to crawl nofollow’ed links nevertheless

Posted on 23 February, 2008

painting a nofollow'ed link dofollow What was meant as a quick test of rel-nofollow once again (inspired by Michelle’s post stating that nofollow’ed comment author links result in rankings), turned out to some interesting observations:

Google uses sneaky JavaScript links (that mask nofollow’ed static links) for discovery crawling, and indexes the link destinations despite there’s no hard coded link on any page on the whole Web.
Google doesn’t crawl URIs found in nofollow’ed links only.
Google most probably doesn’t use anchor text outputted client sided in rankings for the page that carries the JavaScript link.
Google most probably doesn’t pass anchor text of JavaScript links to the link destination.
Google doesn’t pass anchor text of (hard coded) nofollow’ed links to the link destination.

As for my inspiration, I guess not all links in Michelle’s test were truly nofollow’ed. However, she’s spot on stating that condomized author links aren’t useless because they bring in traffic, and can result in clean links when a reader copies the URI from the comment author link and drops it elsewhere. Don’t pay too much attention on REL attributes when you spread your links.

As for my quick test explained below, please consider it an inspiration too. It’s not a full blown SEO test, because I’ve checked one single scenario for a short period of time. However, looking at its results within 24 hours after uploading the test only, makes quite sure that the test isn’t influenced by external noise, for example scraped links and such stuff.

On 2008-02-22 06:20:00 I’ve put a new nofollow’ed link onto my sidebar: Zilchish Crap <a href="http://sebastians-pamphlets.com/repstuff/something.php" id="repstuff-something-a" rel="nofollow"><span id="repstuff-something-b">Zilchish Crap</span></a> <script type="text/javascript"> handle=document.getElementById(‘repstuff-something-b’); handle.firstChild.data=‘Nillified, Nil’; handle=document.getElementById(‘repstuff-something-a’); handle.href=‘http://sebastians-pamphlets.com/repstuff/something.php?nil=js1’; handle.rel=‘dofollow’; </script>
(The JavaScript code changes the link’s HREF, REL and anchor text.)

The purpose of the JavaScript crap was to mask the anchor text, fool CSS that highlights nofollow’ed links (to avoid clean links to the test URI during the test), and to separate requests from crawlers and humans with different URIs.

Google crawls URIs extracted from somewhat sneaky JavaScript code

20 minutes later Googlebot requested the ?nil=js1 URI from the JavaScript code and totally ignored the hard coded URI in the A element’s HREF: 66.249.72.5 2008-02-22 06:47:07 200-OK Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) /repstuff/something.php?nil=js1

Roughly three hours after this visit Googlebot fetched an URI provided only in JS code on the test page: handle=document.getElementById(‘a1’); handle.href=‘http://sebastians-pamphlets.com/repstuff/something.php?nil=js2’; handle.rel=‘dofollow’;
From the log: 66.249.72.5 2008-02-22 09:37:11 200-OK Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) /repstuff/something.php?nil=js2

So far Google ignored the hidden JavaScript link to /repstuff/something.php?nil=js3 on the test page. Its code doesn’t change a static link, so that makes sense in the context of repeated statements like “Google ignores JavaScript links / treats them like nofollow’ed links” by Google reps.

Of course the JS code above is easy to analyze, but don’t think that you can fool Google with concatenated strings, external JS files or encoded JavaScript statements!

Google indexes pages that have only JavaScript links pointing to them

The next day I’ve checked the search index, and the results are interesting:

rel-nofollow-test search results

The first search result is the content of the URI with the query string parameter ?nil=js1, which is outputted with a JavaScript statement on my sidebar, masking the hard coded URI /repstuff/something.php without query string. There’s not a single real link to this URI elsewhere.

The second search result is a post URI where Google recognized the hard coded anchor text “zilchish crap”, but not the JS code that overwrites it with “Nillified, Nil”. With the SERP-URI parameter “&filter=0″ Google shows more posts that are findable with the search term [zilchish]. (Hey Matt and Brian, here’s room for improvement!)

Google doesn’t pass anchor text of nofollow’ed links to the link destination

A search for [zilchish site:sebastians-pamphlets.com] doesn’t show the testpage that doesn’t carry this term. In other words, so far the anchor text “zilchish crap” of the nofollow’ed sidebar link didn’t impact the test page’s rankings yet.

Google doesn’t treat anchor text of JavaScript links as textual content

A search for [nillified site:sebastians-pamphlets.com] doesn’t show any URIs that have “nil, nillified” as client sided anchor text on the sidebar, just the test page:

rel-nofollow-test search results

Results, conclusions, speculation

This test wasn’t intended to evaluate whether JS outputted anchor text gets passed to the link destination or not. Unfortunately “nil” and “nillified” appear both in the JS anchor text as well as on the page, so that’s for another post. However, it seems the JS anchor text isn’t indexed for the pages carrying the JS code, at least they don’t appear in search results for the JS anchor text, so most likely it will not be assigned to the link destination’s relevancy for “nil” or “nillified” as well.

Maybe Google’s algos dealing with client sided outputs need more than 24 hours to assign JS anchor text to link destinations; time will tell if nobody ruins my experiment with links, and that includes unavoidable scraping and its sometimes undetectable links that Google knows but never shows.

However, Google can assign static anchor text pretty fast (within less than 24 hours after link discovery), so I’m quite confident that condomized links still don’t pass reputation, nor topically relevance. My test page is unfindable for the nofollow’ed [zilchish crap]. If that changes later on, that will be the result of other factors, for example scraped pages that link without condom.

How to safely strip a link condom

And what’s the actual “news”? Well, say you’ve links that you must condomize because they’re paid or whatever, but you want that Google discovers the link destinations nevertheless. To accomplish that, just output a nofollow’ed link server sided, and change it to a clean link with JavaScript. Google told us for ages that JS links don’t count, so that’s perfectly in line with Google’s guidelines. And if you keep your anchor text as well as URI, title text and such identical, you don’t cloak with deceitful intent. Other search engines might even pass reputation and relevance based on the client sided version of the link. Isn’t that neat?

Link condoms with juicy taste faking good karma

Of course you can use the JS trick without SEO in mind too. E.g. to prettify your condomized ads and paid links. If a visitor uses CSS to highlight nofollow, they look plain ugly otherwise.

Here is how you can do this for a complete Web page. This link is nofollow’ed. The JavaScript code below changed its REL value to “dofollow”. When you put this code at the bottom of your pages, it will un-condomize all your nofollow’ed links. <script type="text/javascript"> if (document.getElementsByTagName) { var aElements = document.getElementsByTagName("a"); for (var i=0; i<aElements.length; i++) { var relvalue = aElements[i].rel.toUpperCase(); if (relvalue.match("NOFOLLOW") != "null") { aElements[i].rel = "dofollow"; } } } </script>

(You’ll find still condomized links on this page. That’s because the JavaScript routine above changes only links placed above it.)

When you add JavaScript routines like that to your pages, you’ll increase their page loading time. IOW you slow them down. Also, you should add a note to your linking policy to avoid confused advertisers who chase toolbar PageRank.

Updates: Obviously Google distrusts me, how come? Four days after the link discovery the search quality archangel requested the nofollow’ed URI -without query string- possibly to check whether I serve different stuff to bots and people. As if I’d cloak, laughable. (Or an assclown linked the URI without condom.)
Day five: Google’s crawler requested the URI from the totally hidden JavaScript link at the bottom of the test page. Did I hear Google reps stating quite often they aren’t interested in client-sided links at all?

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

19 comments Sebastian | Paid Links, Testing, Anchor Text, Cloaking, Google, SEO, Nofollow

Save bandwidth costs: Dynamic pages can support If-Modified-Since too

Posted on 19 February, 2008

Conditional HTTP GET requests make Webmasters and Crawlers happy When search engine crawlers burn way too much of your bandwidth, this post is for you. Crawlers sent out by major search engines (Google, Yahoo and MSN/Live Search) support conditional GETs, that means they don’t fetch your pages if those didn’t change since the last crawl.

Of course they must fetch your stuff over and over again for this comparision, if your Web server doesn’t play nice with Web robots, as well as with other user agents that can cache your pages and other Web objects like images. The protocol your Web server and the requestors use to handle caching is quite simple, but its implementation can become tricky. Here is how it works:

1st request Feb/10/2008 12:00:00

Googlebot requests /some-page.php from your server. Since Google has just discovered your page, there are no unusual request headers, just a plain GET.

You create the page from a database record which was modified on Feb/09/2008 10:00:00. Your server sends Googlebot the full page (5k) with an HTTP header Date: Sun, 10 Feb 2008 12:00:00 GMT Last-Modified: Sat, 09 Feb 2008 10:00:00 GMT
(lets assume your server is located in Greenwich, UK), the HTTP response code is 200 (OK).

Bandwidth used: 5 kilobytes for the page contents plus less than 500 bytes for the HTTP header.

2nd request Feb/17/2008 12:00:00

Googlebot found interesting links pointing to your page, so it requests /some-page.php again to check for updates. Since Google already knows the resource, Googlebot requests it with an additional HTTP header If-Modified-Since: Sat, 09 Feb 2008 10:00:00 GMT
where the date and time is taken from the Last-Modified header you’ve sent in your response to the previous request.

You didn’t change the page’s record in the database, hence there’s no need to send the full page again. Your Web server sends Googlebot just an HTTP header Date: Sun, 17 Feb 2008 12:00:00 GMT Last-Modified: Sat, 09 Feb 2008 10:00:00 GMT
The HTTP response code is 304 (Not Modified). (Your Web server can suppress the Last-Modified header, because the requestor has this timestamp already.)

Bandwidth used: Less than 500 bytes for the HTTP header.

3rd request Feb/24/2008 12:00:00

Googlebot can’t resist to recrawl /some-page.php, again using the If-Modified-Since: Sat, 09 Feb 2008 10:00:00 GMT
header.

You’ve updated the database on Feb/23/2008 09:00:00 adding a few paragraphs to the article, thus you send Googlebot the full page (now 7k) with this HTTP header Date: Sun, 10 Feb 2008 12:00:00 GMT Last-Modified: Sat, 23 Feb 2008 09:00:00 GMT
and an HTTP response code 200 (OK).

Bandwidth used: 7 kilobytes for the page contents plus less than 500 bytes for the HTTP header.

Further requests

Provided you don’t change the contents again, all further chats between Googlebot and your Web server regarding /some-page.php will burn less than 500 bytes of your bandwidth each. Say Googlebot requests this page weekly, that’s 370k saved bandwidth annually. You do the math. Even with a medium-sized Web site you most likely want to implement proper caching, right?

Not only Webmasters love conditional GET requests that save bandwidth costs and processing time, search engines aren’t keen on useless data transfers too. So lets see how you could respond efficiently to conditional GET requests from search engines. Apache handles caching of static files (e.g. .txt or .html files you upload with FTP) differently from dynamic contents (script outputs with or without a query string in the URI).

Static files

Fortunately, Apache comes with native support of the Last-Modified / If-Modified-Since / Not-Modified functionality. That means that crawlers and your Web server don’t produce too much network traffic when a requested static file didn’t change since the last crawl.

You can test your Web server’s conditional GET support with your robots.txt, or, if even your robots.txt is a script, create a tiny HTML page with a text editor and upload it via FTP. Another neat tool to check HTTP headers is the Live Headers Extension for FireFox (bear in mind that testing crawler behavior with Web browsers is fault-prone by design).

If your second request of an unchanged static file results in a 200 HTTP response code, instead of a 304, call your hosting service. If it works and you’ve only static pages, then bookmark this article and move on.

Dynamic contents

Everything you output with server sided scripts is dynamic content by definition, regardless whether the URI has a query string or not. Even if you just read and print out a static file -that never changes- with PHP, Apache doesn’t add the Last-Modified header which forces crawlers to perform further requests with an If-Modified-Since header.

With dynamic content you can’t rely on Apache’s caching support, you must do it yourself.

The first step is figuring out where your CMS or eCommerce software hides the timestamps telling you the date and time of a page’s last modification. Usually a script pulls its stuff from different database tables, hence a page contains more than one area, or block, of dynamic contents. Every block might have a different last-modified timestamp, but not every block is important enough to serve as the page’s determinant last-modified date. The same goes for templates. Most template tweaks shouldn’t trigger a full blown recrawl, but some do, for example a new address or phone number if such information is present on every page.

For example a blog has posts, pages, comments, categories and other data sources that can change the sidebar’s contents quite frequently. On a page that outputs a single post or page, the last-modified date is determined by the post, respectively its last comment. The main page’s last-modified date is the modified-timestamp of the most recent post, and the same goes for its paginated continuations. A category page’s last-modified date is determined by the category’s most recent post, and so on.

New posts can change outgoing links of older posts when you use plugins that list related posts and stuff like that. There are many more reasons why search engines should crawl older posts at least monthly or so. You might need a routine that changes a blog page’s last-modified timestamp for example when it is a date more than 30 days or so in the past. Also, in some cases it could make sense to have a routine that can reset all timestamps reported as last-modified date for particular site areas, or even the whole site.

If your software doesn’t populate last-modified attributes on changes of all entities, then snap at the chance to consider database triggers, stored procedures, respectively changes of your data access layer. Bear in mind that not all changes of a record must trigger a crawler cache reset. For example a table storing textual contents like articles or product descriptions usually has a number of attributes that don’t affect crawling, thus it should have an attribute last updated that’s changeable in the UI and serves as last-modified date in your crawler cache control (instead of the timestamp that’s changed automatically even on minor updates of attributes which are meaningless for HTML outputs).

Handling Last-Modified, If-Modified-Since, and Not-Modified HTTP headers with PHP/Apache

Below I provide example PHP code I’ve thrown together after midnight in a sleepless night, doped with painkillers. It doesn’t run on a production system, but it should get you started. Adapt it to your needs and make sure you test your stuff intensively. As always, my stuff comes as is without any guarantees.

First grab a couple helpers and put them in an include file you’ve available in all scripts. Since we deal with HTTP headers, you must not output anything before the logic that deals with conditional search engine requests, not even a single white space character, HTML DOCTYPE declaration …
View|hide PHP code. (If you’ve disabled JavaScript you can’t grab the PHP source code!)

<?php function unixTimestamp2HttpDate ($timestamp) { // converts a Unix timestamp to an HTTP date $httpDate = @gmdate("D, d M Y H:i:s", $timestamp) ." GMT"; return $httpDate; // $httpDate === FALSE on invalid input } // end function unixTimestamp2HttpDate function unixTimestamp2MySqlDatetime ($timestamp) { // converts a Unix timestamp to a MySQL datetime $mySqlDatetime = @date("Y-m-d H:i:s", $timestamp); return $mySqlDatetime; } // end function unixTimestamp2MySqlDatetime function date2UnixTimestamp ($date) { // converts any US English date format to a Unix timestamp $timestamp = @strtotime($date); if ($timestamp == -1 || $timestamp === FALSE) { return FALSE; } return $timestamp; } // end function date2UnixTimestamp function makeLastModifiedTimestamp ($timestamp) { // returns a quite safe last-modified-timestamp or the current time $hour = 60 * 60; // consider unsynchronized clocks, or // local time given as GMT and crap like that: $maxDeviant = 13 * $hour; $now = @strtotime("now"); $returnTs = $timestamp + $maxDeviant; if (intval($returnTs) > intval($now)) { return ($now - 5); } return $returnTs; } // end function makeLastModifiedTimestamp function getHttpRequestHeaders () { // returns all HTTP request headers or FALSE $headersArr = array(); if (function_exists("apache_request_headers")) { $headersArr = apache_request_headers(); } else { if (function_exists("getallheaders")) { $headersArr = getallheaders(); } } if (count($headersArr) > 0) { return $headersArr; } return FALSE; } // end function getHttpRequestHeaders function getIfModifiedSince () { // returns If-Modified-Since as Unix Timestamp or FALSE GLOBAL $_SERVER; $headersArr = getHttpRequestHeaders(); if ($headersArr !== FALSE && count($headersArr) > 0) { foreach ($headersArr as $header => $value) { if (stristr($header, "If-Modified-Since")) { $ifModSinceDate = explode(";", $value); $ifModifiedSinceDate = $ifModSinceDate[0]; } } } if (!isset($ifModifiedSinceDate) && isset($_SERVER["IF-MODIFIED-SINCE"]) && !empty($_SERVER["IF-MODIFIED-SINCE"]) ) { $ifModifiedSinceDate = $_SERVER["IF-MODIFIED-SINCE"]; } if (!isset($ifModifiedSinceDate)) { return FALSE; } return date2UnixTimestamp($ifModifiedSinceDate); } // end function getIfModifiedSince () ?>

In general, all user agents should support conditional GET requests, not only search engine crawlers. If you allow long lasting caching, which is fine with search engines that don’t need to crawl your latest Twitter message from your blog’s sidebar, you could leave your visitors with somewhat outdated pages if you serve them 304-Not-Modified responses too.

It might be a good idea to limit 304 responses to conditional GET requests from crawlers, when you don’t implement way shorter caching cycles for other user agents. The latter includes folks that spoof their user agent name as well as scrapers trying to steal your stuff masked as a legit spider. To verify legit search engine crawlers that (should) support conditional GET requests (from Google, Yahoo, MSN and Ask) you can grab my crawler detection routines here. Include them as well, then you can code stuff like that:

$isSpiderUA = checkCrawlerUA (); $isLegitSpider = checkCrawlerIP (__FILE__); if ($isSpiderUA && !$isLegitSpider) { @header("Thou shalt not spoof", TRUE, 403); exit; // make sure your 403-Forbidden ErrorDocument directive in // .htaccess points to a page that explains the issue! } if ($isLegitSpider) { // insert your code dealing with conditional GET requests }

Now that you’re sure that the requestor is a legit crawler from a major search engine, look at the HTTP request header it has submitted to your Web server.

// lookup the HTTP request header for a possible conditional GET $ifModifiedSinceTimestamp = getIfModifiedSince(); // if the request is not conditional, don’t send a 304 $canSend304 = FALSE; if ($ifModifiedSinceTimestamp !== FALSE) { $canSend304 = TRUE; // Tells the requestor that you’ve recognized the conditional GET $echoRequestHeader = "X-Requested-If-modified-since: " .unixTimestamp2HttpDate($ifModifiedSinceTimestamp); @header($echoRequestHeader, TRUE); }

You don’t need to echo the If-Modified-Since HTTP-date in the response header, but this custom header makes testing easier.

Next get the page’s actual last-modified date/time. Here is an (incomplete) code sample for a WordPress single post page.

// select the requested post's comment_count, post_modified and // post_date values, then: if ($wp_post_modified) { $lastModified = date2UnixTimestamp($wp_post_modified); } else { $lastModified = date2UnixTimestamp($wp_post_date); } if (intval($wp_comment_count) > 0) { // select last comment from the WordPress database, then: $lastCommentTimestamp = date2UnixTimestamp($wp_comment_date); if ($lastCommentTimestamp > $lastModified) { $lastModified = $lastCommentTimestamp; } }

The date2UnixTimestamp() function accepts MySQL datetime values as valid input. If you need to (re)write last-modified dates to a MySQL database, convert the Unix timestamps to MySQL datetime values with unixTimestamp2MySqlDatetime().

Your server’s clock isn’t necessarily synchronized with all search engines out there. To cover possible gaps you can use a last-modified timestamp that’s a little bit fresher than the actual last-modified date. In this example the timestamp reported to the crawler is last-modified + 13 hours, you can change the deviant in makeLastModifiedTimestamp().$lastModifiedTimestamp = makeLastModifiedTimestamp($lastModified);

If you compare the timestamps later on, and the request isn’t conditional, don’t run into the 304 routine.if ($ifModifiedSinceTimestamp === FALSE) { // make things equal if the request isn't conditional $ifModifiedSinceTimestamp = $lastModifiedTimestamp; }

You may want to allow a full fetch if the requestor’s timestamp is ancient, in this example older than one month. $tooOld = @strtotime("now") - (31 * 24 * 60 * 60); if ($ifModifiedSinceTimestamp < $tooOld) { $lastModifiedTimestamp = @strtotime("now"); $ifModifiedSinceTimestamp = @strtotime("now") - (1 * 24 * 60 * 60); }
Setting the last-modified attribute to yesterday schedules the next full crawl after this fetch in 30 days (or later, depending on the actual crawl frequency).

Finally respond with 304-Not-Modified if the page wasn’t remarkably changed since the date/time given in the crawler’s If-Modified-Since header. Otherwise send a Last-Modified header with a 200 HTTP response code, allowing the crawler to fetch the page contents. $lastModifiedHeader = "Last-Modified: " .unixTimestamp2HttpDate($lastModifiedTimestamp); if ($lastModifiedTimestamp < $ifModifiedSinceTimestamp && $canSend304) { @header($lastModifiedHeader, TRUE, 304); exit; } else { @header($lastModifiedHeader, TRUE); }

When you’re testing your version of this script with a browser, it will send a standard HTTP request, and your server will return a 200-OK. From your server’s response your browser should recognize the “Last-Modified” header, so when you reload the page the browser should send an “If-Modified-Since” header and you should get the 304 response code if Last-Modified > If-Modified-Since. However, judging from my experience such browser based tests of crawler behavior, respectively responses to crawler requests, aren’t reliable.

Test it with this MS tool instead. I’ve played with it for a while and it works great. With the PHP code above I’ve created a 200/304 test page
http://sebastians-pamphlets.com/tools/last-modified-yesterday.php
that sends a “Last-Modified: Yesterday” response header, and should return a 304-Not Modified HTTP response code when you request it with an “If-Modified-Since: Today+” header, otherwise it should respond with 200-OK (this version returns 200-OK only but tells when it would respond with a 304). You can use this URI with the MS-tool linked above to test HTTP requests with different If-Modified-Since headers.

Have fun and paypal me 50% of your savings.

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

12 comments Sebastian | MSN, Web development, Yahoo, Google

Update your crawler detection: MSN/Live Search announces msnbot/1.1

Posted on 12 February, 2008

msnbot/1.1 Fabrice Canel from Live Search announces significant improvements of their crawler today. The very much appreciated changes are:

HTTP compression

The revised msnbot supports gzip and deflate as defined by RFC 2616 (sections 14.11 and 14.39). Microsoft also provides a tool to check your server’s compression / conditional GET support. (Bear in mind that most dynamic pages (blogs, forums, …) will fool such tools, try it with a static page or use your robots.txt.)

No more crawling of unchanged contents

The new msnbot/1.1 will not fetch pages that didn’t change since the last request, as long as the Web server supports the “If-Modified-Since” header in conditional GET requests. If a page didn’t change since the last crawl, the server responds with 304 and the crawler moves on. In this case your Web server exchanges only a handful of short lines of text with the crawler, not the contents of the requested resource.

If your server isn’t configured for HTTP compression and conditional GETs, you really should request that at your hosting service for the sake of your bandwidth bills.

New user agent name

From reading server log files we know the Live Search bot as “msnbot/1.0 (+http://search.msn.com/msnbot.htm)”, or “msnbot-media/1.0″, “msnbot-products/1.0″, and “msnbot-news/1.0″. From now on you’ll see “msnbot/1.1“. Nathan Buggia from Live Search clarifies: “This update does not apply to all the other ‘msnbot-*’ crawlers, just the main msnbot. We will be updating those bots in the future”.

If you just check the user agent string for “msnbot” you’ve nothing to change, otherwise you should check the user agent string for both “msnbot/1.0″ as well as “msnbot/1.1″ before you do the reverse DNS lookup to identify bogus bots. MSN will not change the host name “.search.live.com” used by the crawling engine.

The announcement didn’t tell us whether the new bot will utilize HTTP/1.1 or not (MS and Yahoo crawlers, like other Web robots, still perform, respectively fake, HTTP/1.0 requests).

It looks like it’s no longer necessary to charge Live Search for bandwidth their crawler has burned. Jokes aside, instead of reporting crawler issues to [email protected], you can post your questions or concerns at a forum dedicated to MSN crawler feedback and discussions.

I’m quite nosy, so I just had to investigate what “there are many more improvements” in the blog post meant. I’ve asked Nathan Buggia from Microsoft a few questions.

Nate, thanks for the opportunity to talk crawling with you. Can you please reveal a few msnbot/1.1 secrets?

I’m glad you’re interested in our update, but we’re not yet ready to provide more details about additional improvements. However, there are several more that we’ll be shipping in the next couple months.

Fair enough. So lets talk about related topics.

Currently I can set crawler directives for file types identified by their extensions in my robots.txt’s msnbot section. Will you fully support wildcards (* and $ for all URI components, that is path and query string) in robots.txt in the foreseeable future?

This is one of several additional improvements that we are looking at today, however it has not been released in the current version of MSNBot. In this update we were squarely focused on reducing the burden of MSNBot on your site.

What can or should a Webmaster do when you seem to crawl a site way too fast, or not fast enough? Do you plan to provide a tool to reduce the server load, respectively speed up your crawling for particular sites?

We currently support the “crawl-delay” option in the robots.txt file for webmasters that would like to slow down our crawling. We do not currently support an option to increase crawling frequency, but that is also a feature we are considering.

Will msnbot/1.1 extract URLs from client sided scripts for discovery crawling? If so, will such links pass reputation?

Currently we do not extract URLs from client-side scripts.

Google’s last change of their infrastructure made nofollow’ed links completely worthless, because they no longer used those in their discovery crawling. Did you change your handling of links with a “nofollow” value in the REL attribute with this upgrade too?

No, changes to how we process nofollow links were not part of this update.

Nate, many thanks for your time and your interesting answers!

Related posts:

Official announcement - by Nathan Buggia, Live Search Webmaster Center Blog
MSNbot 1.1: Live Search Implements A More Efficient Crawl - by Vanessa Fox, Search Engine Land

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

7 comments Sebastian | Analytics, MSN, Crawler Directives, Cloaking, robots.txt, SEO

« Previous Page 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 Next Page »

JudgeDread: Gerhard Heinz is a phoney. If you read his dribble going back months, most of predictions have not been...
mcc_FREE_LIBYA: Hi Sebastian, I moved with my maps 2 another host. Much better navigation now. Now u find my maps...
Sebastian: Alireza Sefati, I’m replying to your comment from a totally clean, er… M$-free computer. ;-)
Alireza Sefati: I think you maybe the last geek on earth promoting IE 9 or any of MS products :)
45south45: Hi again Sebastian So glad to see the reference to mcc’s maps here. They are very helpful for all of...

Unusable user interface

Amateurish database design

Cookie hell

Excessive query strings

Even more sanitizing

Covering robots

Results

How can I detect a shitty CMS?

Collect RSS-Feed URIs

Merge the Feeds

Subscribe and Enjoy

What is NOINDEX?

How major search engines treat NOINDEX

Matt’s question: How should Google handle NOINDEX in the future?

My take: Yes, No, Depends

A fair solution: NOREFERENCE

Google crawls URIs extracted from somewhat sneaky JavaScript code

Google indexes pages that have only JavaScript links pointing to them

Google doesn’t pass anchor text of nofollow’ed links to the link destination

Google doesn’t treat anchor text of JavaScript links as textual content

Results, conclusions, speculation

How to safely strip a link condom

Link condoms with juicy taste faking good karma

Static files

Dynamic contents

Handling Last-Modified, If-Modified-Since, and Not-Modified HTTP headers with PHP/Apache

INTERNET MARKETING NINJAS

Jump Center

My recent pamphlets

Your Valued Comments

Archived Pamphlets

Paid Links, err Recommendations

Charity Links

Hardcore SEO Porn

Sexy T-Shirts ...

What I'm Doing...

Sebastian's picked gems

Stuff4Bots