SEO

Archived posts from the 'SEO' Category

About the bad taste of shameless ego food

Posted on 30 January, 2009

Seems I’ve made it on the short-list at the SEMMY 2009 Awards in the Search Tech category. Great ego food. I’m honored. Thanks for nominating me that often! And thanks to John Andrews and Todd Mintz for the kind judgement!

Now that you’ve read the longish introduction, why not click here and vote for my pamphlet?

Ok Ok Ok, it’s somewhat technically and you perhaps even consider it plain geek food. However, it’s hopefully / nevertheless useful for your daily work. BTW … I wish more search engine engineers would read it. It could help them to tidy up their flawed REP support.

Does this post smell way too selfish? I guess it does, but I’ll post it nonetheless coz I’m fucking keen on your votes. Thanks in advance!

Wow, I won! Thank you all!

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

2 comments Sebastian | X-Robots-Tag, Ego Food, URL removal, Robots Meta Tags, robots.txt, SEO

Crawling vs. Indexing

Posted on 20 October, 2008

Sigh. I just have to throw in my 2 cents.

Crawling means sucking content without processing the results. Crawlers are rather dumb processes that fetch content supplied by Web servers answering (HTTP) requests of requested URIs, delivering those contents to other processes, e.g. crawling caches or directly to indexers. Crawlers get their URIs from a crawling engine that’s feeded from different sources, including links extracted from previously crawled Web documents, URI submissions, foreign Web indexes, and whatnot.

Indexing means making sense out of the retrieved contents, storing the processing results in a (more or less complex) document index. Link analysis is a way to measure URI importance, popularity, trustworthiness and so on. Link analysis is often just a helper within the indexing process, sometimes the end in itself, but traditionally a task of the indexer, not the crawler (high sophisticated crawling engines do use link data to steer their crawlers, but that has nothing to do with link analysis in document indexes).

A crawler directive like “disallow” in robots.txt can direct crawlers, but means nothing to indexers.

An indexer directive like “noindex” in an HTTP header, an HTML document’s HEAD section, or even a robots.txt file, can direct indexers, but means nothing to crawlers, because the crawlers have to fetch the document in order to enable the indexer to obey those (inline) directives.

So when a Web service offers an indexer directive like <meta name="SEOservice" content="noindex" /> to keep particular content out of its index, but doesn’t offer a crawler directive like User-agent: SEOservice Disallow: /, this Web service doesn’t crawl.

That’s not about semantics, that’s about Web standards.

Whether or not such a Web service can come with incredible values squeezed out of its index gathered elsewhere, without crawling the Web itself, is a completely different story.

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

15 comments Sebastian | X-Robots-Tag, Robots Meta Tags, Crawler Directives, robots.txt, SEO

You can’t escape from Google-Jail when …

Posted on 27 February, 2008

spammers stuck in google jail … you’ve boosted your business Web site’s rankings with shitloads of crappy links. The 11th SEO commandment: Don’t promote your white hat sites with black hat link building methods! It may work for a while, but once you find your butt in Google-jail, there’s no way out. Not even a reconsideration request can help because you can’t provide its prerequisites.

When you’re caught eventually -penalized for tons of stinky links- and have to file a reinclusion request, Google wants you to remove all the shady links you’ve spread on the Web before they lift your penalty. Here is an example, well documented in a Google Groups thread started by a penalized site owner with official statements from Matt Cutts and John Müller from Google.

The site in question, a small family business from the UK, has used more or less every tactic from a lazy link builder’s textbook to create 40,000+ inbound links. Sponsored WordPress themes, paid links, comment spam, artificial link exchanges and whatnot.

Most sites that carry these links are in no way related to the penalized site, which deals with modern teak garden furniture and home furniture sets, for example porn galleries, Web designers, US city guides, obscure oriental blogs, job boards, or cat masturbation guides. (Don’t get me wrong. Of course not every link has to be topically related. Every link from a trusted page can pass PageRank, and can improve crawling, indexing, and so on.)

Google has absolutely no problem with unrelated links, unless a site’s link profile consists of way too many spammy and/or unrelated links. That does not mean that spreading a gazillion low-life links pointing to a competitor will get this site penalized or even banned. Negative SEO is not that simple. For an innocent site Google just ignores spammy inbound links, but most probably flags it for further investigations, both manually as well as algorithmically.

If on the other hand Google finds evidence that a site is actively involved in link monkey business of any kind, that’s a completely different story. Such evidence could be massively linking out to spammy places, hosting reciprocal links pages or FFA directories, unskillful (manual|automated) comment spam, signature links and mentions at places that trade links, textual contents made for (paid) link campaigns when reused too often, buying links from trackable services, (link request emails forwarded via) paid-link/spam reports, and so on.

Below is the “how to file a successful reconsideration request when your sins include link spam” from Googlers.

Matt Cutts:

The recommendation from your SEO guy led you directly into a pretty high-risk area; I doubt you really want pages like (NSAW) having sponsored links to your furniture site anyway. It’s definitely possible to extricate your site, but I would make an effort to contact the sites with your sponsored links and request that they remove the links, and then do a reconsideration request. Maybe in the text of your reconsideration request, I’d include a pointer to this thread as well.

John Müller:

You may want to consider what you can do to help clean up similar [=spammy] links on other people’s sites. Blogs and newspaper sites such as http://media.www.dailypennsylvanian.com sometimes receive short comments such as “dont agree”, apparently only for a link back to a site. These comments often use keywords from that site instead of a user name, perhaps “tree bench” for a furniture site or “sexy shoes” for a footwear site. If this kind of behavior might have taken place for your site, you may want to work on rectifying it and include some information on it in your reconsideration request. Given your situation, the person considering your reconsideration request might be curious about links like that.

Translation: We’ll ignore your weekly reconsideration requests unless you’ve removed all artificial links pointing to your site. You’re stuck in Google’s dungeon because they’ve thrown away the keys.

I’d guess that for a site that has filed a reinclusion request stating the site was involved in some sort of link monkey business, Google applies a more strict policy than with a site that was attacked by negative SEO methods. I highly doubt that when caught red-handed a lame excuse like “I didn’t create those links” is a tactic I could recommend, because Googlers hate it when an applicant lies in a reinclusion request.

Once caught and penalized, the “since when do inbound links count as negative votes” argument doesn’t apply. It’s quite clear that removing the traces (admitted as well as not admitted shady links) is a prerequisite for a penalty lift. And that even though Google has already discounted these links. That’s the same as with penalized doorway pages. Redirecting doorways to legit landing pages doesn’t count, Google wants to see a 410-Gone HTTP response code (or at least a 404) before they un-penalize a site.

I doubt that’s common knowledge to folks who promote their white hat sites with black hat methods. Getting links wiped out at places that didn’t check the intention of inserted links in the first place is a royal PITA, in other words, it’s impossible to get all shady links removed once you find your butt in Google-jail. That’s extremely uncomfortable for site owners who fell for questionable forum advice or hired a promotional service (no, I don’t call such assclowns SEOs) applying shady marketing methods without a clear and written warning that those are extremely risky, fully explained and signed by the client.

Maybe in some cases Google will un-penalize a great site although not all link spam was wiped out. However, the costs and efforts of preparing a successful resonsideration request are immense, not to speak of the massive loss of traffic and income.

As Barry mentioned, the thread linked above might be interesting for folks keen on an official confirmation that Google -60 penalties exist. I’d say such SERP penalties (aka red & yellow cards) aren’t exactly new, and it plays no role to which position a site penalized for guideline violations gets downranked. When I’ve lost a top spot for gaming Google, that’s kismet. I’m not interested in figuring out that 20k spammy links get me a -30 penalty, 40k shady links result in a -60 penalty, and 100k unnatural links qualify me for the famous -950 bashing (the numbers are made up of course). If I’d spam, then I’d just move on because I’d have already launched enough other projects to compensate the losses.

PS: While I was typing, Barry Schwartz posted his Google-Jail story at SE Roundtable.

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

45 comments Sebastian | Reciprocal Links, Webspam, Risky Linkage, Spam Report, SEO, Paid Links, Google

@ALL: Give Google your feedback on NOINDEX, but read this pamphlet beforehand!

Posted on 25 February, 2008

Dear Google, please respect NOINDEX Matt Cutts asks us How should Google handle NOINDEX? That’s a tough question worth thinking twice before you submit a comment to Matt’s post. Here is Matt’s question, all the background information you need, and my opinion.

What is NOINDEX?

Noindex is an indexer directive defined in the Robots Exclusion Protocol (REP) from 1996 for use in robots meta tags. Putting a NOINDEX value in a page’s robots meta tag or X-Robots-Tag tells search engines that they shall not index the page content, but may follow links provided on the page.

To get a grip on NOINDEX’s role in the REP please read my Robots Exclusion Protocol summary at SEOmoz. Also, Google experiments with NOINDEX as crawler directive in robots.txt, more on that later.

How major search engines treat NOINDEX

Of course you could read a ton of my pamphlets to extract this information, but Matt’s summary is still accurate and easier to digest:

[Matt Cutts on August 30, 2006]
Google doesn’t show the page in any way.

Ask doesn’t show the page in any way.

MSN shows a URL reference and cached link, but no snippet. Clicking the cached link doesn’t return anything.

Yahoo! shows a URL reference and cached link, but no snippet. Clicking on the cached link returns the cached page.

Personally, I’d prefer it if every search engine treated the noindex meta tag by not showing a page in the search results at all. [Meanwhile Matt might have a slightly different opinion.]

Google’s experimental support of NOINDEX as crawler directive in robots.txt also includes the DISALLOW functionality (an instruction that forbids crawling), and most probably URIs tagged with NOINDEX in robots.txt cannot accumulate PageRank. In my humble opinion the DISALLOW behavior of NOINDEX in robots.txt is completely wrong, and without any doubt in no way compliant to the Robots Exclusion Protocol.

Matt’s question: How should Google handle NOINDEX in the future?

To simplify Matt’s poll, lets assume he’s talking about NOINDEX as indexer directive, regardless where a Webmaster has put it (robots meta tag, X-Robots-Tag, or robots.txt).

The question is whether Google should completely drop a NOINDEX’ed page from our search results vs. show a reference to the page, or something in between?

Here are the arguments, or pros and cons, for each variant:

Google should completely drop a NOINDEX’ed page from their search results

Obviously that’s what most Webmasters would prefer:

This is the behavior that we’ve done for the last several years, and webmasters are used to it. The NOINDEX meta tag gives a good way — in fact, one of the only ways — to completely remove all traces of a site from Google (another way is our url removal tool). That’s incredibly useful for webmasters.

NOINDEX means don’t index, search engines must respect such directives, even when the content isn’t password protected or cloaked away (redirected or hidden for crawlers but not for visitors).

The corner case that Google discovers a link and lists it on their SERPs before the page that carries a NOINDEX directive is crawled and deindexed isn’t crucial, and could be avoided by a (new) NOINDEX indexer directive in robots.txt, which is requested by search engines quite frequently. Ok, maybe Google’s BlitzCrawler™ has to request robots.txt more often then.

Google should show a reference to NOINDEX’ed pages on their SERPs

Search quality and user experience are strong arguments:

Our highest duty has to be to our users, not to an individual webmaster. When a user does a navigational query and we don’t return the right link because of a NOINDEX tag, it hurts the user experience (plus it looks like a Google issue). If a webmaster really wants to be out of Google without even a single trace, they can use Google’s url removal tool. The numbers are small, but we definitely see some sites accidentally remove themselves from Google. For example, if a webmaster adds a NOINDEX meta tag to finish a site and then forgets to remove the tag, the site will stay out of Google until the webmaster realizes what the problem is. In addition, we recently saw a spate of high-profile Korean sites not returned in Google because they all have a NOINDEX meta tag. If high-profile sites like [3 linked examples] aren’t showing up in Google because of the NOINDEX meta tag, that’s bad for users (and thus for Google).

Search quality and searchers’ user experience is also a strong argument for totally delisting NOINDEX’ed pages, because most Webmasters use this indexer directive to keep stuff that doesn’t provide value for searchers out of the search indexes. <polemic>I mean, how much weight have a few Korean sites when it comes to decisions that affect the whole Web?</polemic>

If a Webmaster puts a NOINDEX directive by accident, that’s easy to spot in the site’s stats, considering the volume of traffic that Google controls. I highly doubt that a simple URI reference with an anchor text scrubbed from external links on Google SERPs would heal such a mistake. Also, Matt said that Google could add a NOINDEX check to the Webmaster Console.

The reference to the URI removal tools is out of context, because these tools remove an URI only for a short period of time and all removal requests have to be resubmitted repeatedly every few weeks. NOINDEX on the other hand is a way to keep an URI out of the index as long as this crawler directive is provided.

I’d say the sole argument for listing references to NOINDEX’ed pages that counts is misleading navigational searches. Of course that does not mean that Google may ignore the NOINDEX directive to show -with a linked reference- that they know a resource, despite the fact that the site owner has strictly forbidden such references on SERPs.

Something in between, Google should find a reasonable way to please both Webmasters and searchers

Quoting Matt again:

The vast majority of webmasters who use NOINDEX do so deliberately and use the meta tag correctly (e.g. for parked domains that they don’t want to show up in Google). Users are most discouraged when they search for a well-known site and can’t find it. What if Google treated NOINDEX differently if the site was well-known? For example, if the site was in the Open Directory, then show a reference to the page even if the site used the NOINDEX meta tag. Otherwise, don’t show the site at all. The majority of webmasters could remove their site from Google, but Google would still return higher-profile sites when users searched for them.

Whether or not a site is popular must not impact a search engine’s respect for a Webmaster’s decision to keep search engines, and their users, out of her realm. That reads like “Hey, Google is popular, so we’ve the right to go to Mountain View to pillage the Googleplex, acquiring everything we can steal for the public domain”. Neither Webmasters nor search engines should mimic Robin Hood. Also, lots of Webmasters highly doubt that Google’s idea of (link) popularity should rule the Web.

Whether or not a site is listed in the ODP directory is definitely not an indicator that can be applied here. Last time I looked the majority of the Web’s content wasn’t listed at DMOZ due to the lack of editors and various other reasons, and that includes gazillions of great and useful resources. I’m not bashing DMOZ here, but as a matter of fact it’s not comprehensive enough to serve as indicator for anything, especially not importance and popularity.

I strongly believe that there’s no such thing as a criterion suitable to mark out a two class Web.

My take: Yes, No, Depends

Google could enhance navigational queries -and even “I feel lucky” queries- that lead to a NOINDEX’ed page with a message like “The best matching result for this query was blocked by the site”. I wouldn’t mind if they mention the URI as long as it’s not linked.

In fact, the problem is the granularity of the existing indexer directives. NOINDEX is neither meant for nor capable of serving that many purposes. It is wrong to assign DISALLOW semantics to NOINDEX, and it is wrong to create two classes of NOINDEX support. Fortunately, we’ve more REP indexer directives that could play a role in this discussion.

NOODP, NOYDIR, NOARCHIVE and/or NOSNIPPET in combination with NOINDEX on a site’s home page, that is either a domain or subdomain, could indicate that search engines must not show references to the URI in question. Otherwise, if no other indexer directives elaborate NOINDEX, search engines could show references to NOINDEX’ed main pages. The majority of navigational search queries should lead to main pages, so that would solve the search quality issues.

Of course that’s not precise enough due to the lack of a specific directive that deals with references to forbidden URIs, but it’s way better than ignoring NOINDEX in its current meaning.

A fair solution: NOREFERENCE

If I’d make the decision at Google and couldn’t live with a best matching search result blocked message, I’d go for a new REP tag:

“NOINDEX, NOREFERENCE” in a robots meta tag -respectively Googlebot meta tag- or X-Robots-Tag forbids search engines to show a reference on their SERPs. In robots.txt this would look like NOINDEX: / NOINDEX: /blog/ NOINDEX: /members/ … NOREFERENCE: / NOREFERENCE: /blog/ NOREFERENCE: /members/ …
Search engines would crawl these URIs, and follow their links as long as there’s no NOFOLLOW directive either in robots.txt or a page specific instruction.

NOINDEX without a NOREFERENCE directive would instruct search engines not to index a page, but allows references on SERPs. Supporting this indexer directive both in robots.txt as well as on-the-page (respectively in the HTTP header for X-Robots-Tags) makes it easy to add NOREFERENCE on sites that hate search engine traffic. Also, a syntax variant like NOINDEX=NOREFERENCE for robots.txt could tell search eniges how they have to treat NOINDEX statements on site level, or even on site area level.

Even more appealing would be NOINDEX=REFERENCE, because only the very few Webmasters that would like to see their NOINDEX’ed URIs on Google’s SERPs would have to add a directive to their robots.txt at all. Unfortunately, that’s not doable for Google unless they can convice three well known Korean sites to edit their robots.txt.

By the way, don’t miss out on my draft asking for REP tag support in robots.txt!

Anyway: Dear Google, please don’t touch NOINDEX!

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

6 comments Sebastian | URL removal, Search Quality, X-Robots-Tag, Robots Meta Tags, Crawler Directives, SEO, robots.txt, Google

Nofollow still means don’t follow, and how to instruct Google to crawl nofollow’ed links nevertheless

Posted on 23 February, 2008

painting a nofollow'ed link dofollow What was meant as a quick test of rel-nofollow once again (inspired by Michelle’s post stating that nofollow’ed comment author links result in rankings), turned out to some interesting observations:

Google uses sneaky JavaScript links (that mask nofollow’ed static links) for discovery crawling, and indexes the link destinations despite there’s no hard coded link on any page on the whole Web.
Google doesn’t crawl URIs found in nofollow’ed links only.
Google most probably doesn’t use anchor text outputted client sided in rankings for the page that carries the JavaScript link.
Google most probably doesn’t pass anchor text of JavaScript links to the link destination.
Google doesn’t pass anchor text of (hard coded) nofollow’ed links to the link destination.

As for my inspiration, I guess not all links in Michelle’s test were truly nofollow’ed. However, she’s spot on stating that condomized author links aren’t useless because they bring in traffic, and can result in clean links when a reader copies the URI from the comment author link and drops it elsewhere. Don’t pay too much attention on REL attributes when you spread your links.

As for my quick test explained below, please consider it an inspiration too. It’s not a full blown SEO test, because I’ve checked one single scenario for a short period of time. However, looking at its results within 24 hours after uploading the test only, makes quite sure that the test isn’t influenced by external noise, for example scraped links and such stuff.

On 2008-02-22 06:20:00 I’ve put a new nofollow’ed link onto my sidebar: Zilchish Crap <a href="http://sebastians-pamphlets.com/repstuff/something.php" id="repstuff-something-a" rel="nofollow"><span id="repstuff-something-b">Zilchish Crap</span></a> <script type="text/javascript"> handle=document.getElementById(‘repstuff-something-b’); handle.firstChild.data=‘Nillified, Nil’; handle=document.getElementById(‘repstuff-something-a’); handle.href=‘http://sebastians-pamphlets.com/repstuff/something.php?nil=js1’; handle.rel=‘dofollow’; </script>
(The JavaScript code changes the link’s HREF, REL and anchor text.)

The purpose of the JavaScript crap was to mask the anchor text, fool CSS that highlights nofollow’ed links (to avoid clean links to the test URI during the test), and to separate requests from crawlers and humans with different URIs.

Google crawls URIs extracted from somewhat sneaky JavaScript code

20 minutes later Googlebot requested the ?nil=js1 URI from the JavaScript code and totally ignored the hard coded URI in the A element’s HREF: 66.249.72.5 2008-02-22 06:47:07 200-OK Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) /repstuff/something.php?nil=js1

Roughly three hours after this visit Googlebot fetched an URI provided only in JS code on the test page: handle=document.getElementById(‘a1’); handle.href=‘http://sebastians-pamphlets.com/repstuff/something.php?nil=js2’; handle.rel=‘dofollow’;
From the log: 66.249.72.5 2008-02-22 09:37:11 200-OK Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) /repstuff/something.php?nil=js2

So far Google ignored the hidden JavaScript link to /repstuff/something.php?nil=js3 on the test page. Its code doesn’t change a static link, so that makes sense in the context of repeated statements like “Google ignores JavaScript links / treats them like nofollow’ed links” by Google reps.

Of course the JS code above is easy to analyze, but don’t think that you can fool Google with concatenated strings, external JS files or encoded JavaScript statements!

Google indexes pages that have only JavaScript links pointing to them

The next day I’ve checked the search index, and the results are interesting:

rel-nofollow-test search results

The first search result is the content of the URI with the query string parameter ?nil=js1, which is outputted with a JavaScript statement on my sidebar, masking the hard coded URI /repstuff/something.php without query string. There’s not a single real link to this URI elsewhere.

The second search result is a post URI where Google recognized the hard coded anchor text “zilchish crap”, but not the JS code that overwrites it with “Nillified, Nil”. With the SERP-URI parameter “&filter=0″ Google shows more posts that are findable with the search term [zilchish]. (Hey Matt and Brian, here’s room for improvement!)

Google doesn’t pass anchor text of nofollow’ed links to the link destination

A search for [zilchish site:sebastians-pamphlets.com] doesn’t show the testpage that doesn’t carry this term. In other words, so far the anchor text “zilchish crap” of the nofollow’ed sidebar link didn’t impact the test page’s rankings yet.

Google doesn’t treat anchor text of JavaScript links as textual content

A search for [nillified site:sebastians-pamphlets.com] doesn’t show any URIs that have “nil, nillified” as client sided anchor text on the sidebar, just the test page:

rel-nofollow-test search results

Results, conclusions, speculation

This test wasn’t intended to evaluate whether JS outputted anchor text gets passed to the link destination or not. Unfortunately “nil” and “nillified” appear both in the JS anchor text as well as on the page, so that’s for another post. However, it seems the JS anchor text isn’t indexed for the pages carrying the JS code, at least they don’t appear in search results for the JS anchor text, so most likely it will not be assigned to the link destination’s relevancy for “nil” or “nillified” as well.

Maybe Google’s algos dealing with client sided outputs need more than 24 hours to assign JS anchor text to link destinations; time will tell if nobody ruins my experiment with links, and that includes unavoidable scraping and its sometimes undetectable links that Google knows but never shows.

However, Google can assign static anchor text pretty fast (within less than 24 hours after link discovery), so I’m quite confident that condomized links still don’t pass reputation, nor topically relevance. My test page is unfindable for the nofollow’ed [zilchish crap]. If that changes later on, that will be the result of other factors, for example scraped pages that link without condom.

How to safely strip a link condom

And what’s the actual “news”? Well, say you’ve links that you must condomize because they’re paid or whatever, but you want that Google discovers the link destinations nevertheless. To accomplish that, just output a nofollow’ed link server sided, and change it to a clean link with JavaScript. Google told us for ages that JS links don’t count, so that’s perfectly in line with Google’s guidelines. And if you keep your anchor text as well as URI, title text and such identical, you don’t cloak with deceitful intent. Other search engines might even pass reputation and relevance based on the client sided version of the link. Isn’t that neat?

Link condoms with juicy taste faking good karma

Of course you can use the JS trick without SEO in mind too. E.g. to prettify your condomized ads and paid links. If a visitor uses CSS to highlight nofollow, they look plain ugly otherwise.

Here is how you can do this for a complete Web page. This link is nofollow’ed. The JavaScript code below changed its REL value to “dofollow”. When you put this code at the bottom of your pages, it will un-condomize all your nofollow’ed links. <script type="text/javascript"> if (document.getElementsByTagName) { var aElements = document.getElementsByTagName("a"); for (var i=0; i<aElements.length; i++) { var relvalue = aElements[i].rel.toUpperCase(); if (relvalue.match("NOFOLLOW") != "null") { aElements[i].rel = "dofollow"; } } } </script>

(You’ll find still condomized links on this page. That’s because the JavaScript routine above changes only links placed above it.)

When you add JavaScript routines like that to your pages, you’ll increase their page loading time. IOW you slow them down. Also, you should add a note to your linking policy to avoid confused advertisers who chase toolbar PageRank.

Updates: Obviously Google distrusts me, how come? Four days after the link discovery the search quality archangel requested the nofollow’ed URI -without query string- possibly to check whether I serve different stuff to bots and people. As if I’d cloak, laughable. (Or an assclown linked the URI without condom.)
Day five: Google’s crawler requested the URI from the totally hidden JavaScript link at the bottom of the test page. Did I hear Google reps stating quite often they aren’t interested in client-sided links at all?

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

19 comments Sebastian | Paid Links, Testing, Anchor Text, Cloaking, Google, SEO, Nofollow

Update your crawler detection: MSN/Live Search announces msnbot/1.1

Posted on 12 February, 2008

msnbot/1.1 Fabrice Canel from Live Search announces significant improvements of their crawler today. The very much appreciated changes are:

HTTP compression

The revised msnbot supports gzip and deflate as defined by RFC 2616 (sections 14.11 and 14.39). Microsoft also provides a tool to check your server’s compression / conditional GET support. (Bear in mind that most dynamic pages (blogs, forums, …) will fool such tools, try it with a static page or use your robots.txt.)

No more crawling of unchanged contents

The new msnbot/1.1 will not fetch pages that didn’t change since the last request, as long as the Web server supports the “If-Modified-Since” header in conditional GET requests. If a page didn’t change since the last crawl, the server responds with 304 and the crawler moves on. In this case your Web server exchanges only a handful of short lines of text with the crawler, not the contents of the requested resource.

If your server isn’t configured for HTTP compression and conditional GETs, you really should request that at your hosting service for the sake of your bandwidth bills.

New user agent name

From reading server log files we know the Live Search bot as “msnbot/1.0 (+http://search.msn.com/msnbot.htm)”, or “msnbot-media/1.0″, “msnbot-products/1.0″, and “msnbot-news/1.0″. From now on you’ll see “msnbot/1.1“. Nathan Buggia from Live Search clarifies: “This update does not apply to all the other ‘msnbot-*’ crawlers, just the main msnbot. We will be updating those bots in the future”.

If you just check the user agent string for “msnbot” you’ve nothing to change, otherwise you should check the user agent string for both “msnbot/1.0″ as well as “msnbot/1.1″ before you do the reverse DNS lookup to identify bogus bots. MSN will not change the host name “.search.live.com” used by the crawling engine.

The announcement didn’t tell us whether the new bot will utilize HTTP/1.1 or not (MS and Yahoo crawlers, like other Web robots, still perform, respectively fake, HTTP/1.0 requests).

It looks like it’s no longer necessary to charge Live Search for bandwidth their crawler has burned. Jokes aside, instead of reporting crawler issues to [email protected], you can post your questions or concerns at a forum dedicated to MSN crawler feedback and discussions.

I’m quite nosy, so I just had to investigate what “there are many more improvements” in the blog post meant. I’ve asked Nathan Buggia from Microsoft a few questions.

Nate, thanks for the opportunity to talk crawling with you. Can you please reveal a few msnbot/1.1 secrets?

I’m glad you’re interested in our update, but we’re not yet ready to provide more details about additional improvements. However, there are several more that we’ll be shipping in the next couple months.

Fair enough. So lets talk about related topics.

Currently I can set crawler directives for file types identified by their extensions in my robots.txt’s msnbot section. Will you fully support wildcards (* and $ for all URI components, that is path and query string) in robots.txt in the foreseeable future?

This is one of several additional improvements that we are looking at today, however it has not been released in the current version of MSNBot. In this update we were squarely focused on reducing the burden of MSNBot on your site.

What can or should a Webmaster do when you seem to crawl a site way too fast, or not fast enough? Do you plan to provide a tool to reduce the server load, respectively speed up your crawling for particular sites?

We currently support the “crawl-delay” option in the robots.txt file for webmasters that would like to slow down our crawling. We do not currently support an option to increase crawling frequency, but that is also a feature we are considering.

Will msnbot/1.1 extract URLs from client sided scripts for discovery crawling? If so, will such links pass reputation?

Currently we do not extract URLs from client-side scripts.

Google’s last change of their infrastructure made nofollow’ed links completely worthless, because they no longer used those in their discovery crawling. Did you change your handling of links with a “nofollow” value in the REL attribute with this upgrade too?

No, changes to how we process nofollow links were not part of this update.

Nate, many thanks for your time and your interesting answers!

Related posts:

Official announcement - by Nathan Buggia, Live Search Webmaster Center Blog
MSNbot 1.1: Live Search Implements A More Efficient Crawl - by Vanessa Fox, Search Engine Land

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

7 comments Sebastian | Analytics, MSN, Crawler Directives, Cloaking, robots.txt, SEO

Why storing URLs with truncated trailing slashes is an utterly idiocy

Posted on 6 February, 2008

Yahoo steals my trailing slashes With some Web services URL canonicalization has a downside. What works great for major search engines like Google can fire back when a Web service like Yahoo thinks circumcising URLs is cool. Proper URL canonicalization might, for example, screw your blog’s reputation at Technorati.

In fact the problem is not your URL canonicalization, e.g. 301 redirects from http://example.com to http://example.com/ respectively http://example.com/directory to http://example.com/directory/, but crappy software that removes trailing forward slashes from your URLs.

Dear Web developers, if you really think that home page locations respectively directory URLs look way cooler without the trailing slash, then by all means manipulate the anchor text, but do not manipulate HREF values, and do not store truncated URLs in your databases (not that “http://example.com” as anchor text makes any sense when the URL in HREF points to “http://example.com/”). Spreading invalid URLs is not funny. People as well as Web robots take invalid URLs from your pages for various purposes. Many usages of invalid URLs are capable to damage the search engine rankings of the link destinations. You can’t control that, hence don’t screw our URLs. Never. Period.

Folks who don’t agree with the above said read on.

TOC:

What is a trailing slash? About URLs, directory URIs, default documents, directory indexes, …
How to rescue stolen trailing slashes About Apache’s handling of directory requests, and rewriting respectively redirecting invalid directory URIs in .htaccess as well as in PHP scripts.
Why stealing trailing slashes is not cool Truncating slashes is not only plain robbery (bandwidth theft), it often causes malfunctions at the destination server and 3rd party services as well.
How URL canonicalization irritates Technorati 301 redirects that “add” a trailing slash to directory URLs, respectively virtual URIs that mimic directories, seem to irritate Technorati so much that it can’t compute reputation, recent post lists, and so on.

What is a trailing slash?

The Web’s standards say (links and full quotes): The trailing path segment delimiter “/” represents an empty last path segment. Normalization should not remove delimiters when their associated component is empty. (Read the polite “should” as “must”.)

To understand that, lets look at the most common URL components:
scheme:// server-name.tld /path ?query-string #fragment
The (red) path part begins with a forward slash “/” and must consist of at least one byte (the trailing slash itself in case of the home page URL http://example.com/).

If an URL ends with a slash, it points to a directory’s default document, or, if there’s no default document, to a list of objects stored in a directory. The home page link lacks a directory name, because “/” after the TLD (.com|net|org|…) stands for the root directory.

Automated directory indexes (a list of links to all files) should be forbidden, use Options -Indexes in .htaccess to send such requests to your 403-Forbidden page.

In order to set default file names and their search sequence for your directories use DirectoryIndex index.html index.htm index.php /error_handler/missing_directory_index_doc.php. In this example: on request of http://example.com/directory/ Apache will first look for /directory/index.html, then if that doesn’t exist for /directory/index.htm, then /directory/index.php, and if all that fails, it will serve an error page (that should log such requests so that the Webmaster can upload the missing default document to /directory/).

The URL http://example.com (without the trailing slash) is invalid, and there’s no specification telling a reason why a Web server should respond to it with meaningful contents. Actually, the location http://example.com points to Null (nil, zilch, nada, zip, nothing), hence the correct response is “404 - we haven’t got ‘nothing to serve’ yet”.

The same goes for sub-directories. If there’s no file named “/dir”, the URL http://example.com/dir points to Null too. If you’ve a directory named “/dir”, the canonical URL http://example.com/dir/ either points to a directory index page (an autogenerated list of all files) or the directory’s default document “index.(html|htm|shtml|php|…)”. A request of http://example.com/dir -without the trailing slash that tells the Web server that the request is for a directory’s index- resolves to “not found”.

You must not reference a default document by its name! If you’ve links like http://example.com/index.html you can’t change the underlying technology without serious hassles. Say you’ve a static site with a file structure like /index.html, /contact/index.html, /about/index.html and so on. Tomorrow you’ll realize that static stuff sucks, hence you’ll develop a dynamic site with PHP. You’ll end up with new files: /index.php, /contact/index.php, /about/index.php and so on. If you’ve coded your internal links as http://example.com/contact/ etc. they’ll still work, without redirects from .html to .php. Just change the DirectoryIndex directive from “… index.html … index.php …” to “… index.php … index.html …”. (Of course you can configure Apache to parse .html files for PHP code, but that’s another story.)

It seems that truncating default document names can make sense for services that deal with URLs, but watch out for sites that serve different contents under various extensions of “index” files (intentionally or not). I’d say that folks submitting their ugly index.html files to directories, search engines, top lists and whatnot deserve all the hassles that come with later changes.

How to rescue stolen trailing slashes

Since Web servers know that users are faulty by design, they jump through a couple of resource burning hoops in order to either add the trailing slash so that relative references inside HTML documents (CSS/JS/feed links, image locations, HREF values …) work correctly, or apply voodoo to accomplish that without (visibly) changing the address bar.

With Apache, DirectorySlash On enables this behavior (check whether your Apache version does 301 or 302 redirects, in case of 302s find another solution). You can also rewrite invalid requests in .htaccess when you need special rules: RewriteEngine on RewriteBase /content/ RewriteRule ^dir1$ http://example.com/content/dir1/ [R=301,L] RewriteRule ^dir2$ http://example.com/content/dir2/ [R=301,L]

With content management systems (CMS) that generate virtual URLs on the fly, often there’s no other chance than hacking the software to canonicalize invalid requests. To prevent search engines from indexing invalid URLs that are in fact duplicates of canonical URLs, you’ll perform permanent redirects (301).

Here is a WordPress (header.php) example: $requestUri = $_SERVER["REQUEST_URI"]; $queryString = $_SERVER["QUERY_STRING"]; $doRedirect = FALSE; $fileExtensions = array(".html", ".htm", ".php"); $serverName = $_SERVER["SERVER_NAME"]; $canonicalServerName = $serverName; // if you prefer http://example.com/* URLs remove the "www.": $srvArr = explode(".", $serverName); $canonicalServerName = $srvArr[count($srvArr) - 2] ."." .$srvArr[count($srvArr) - 1]; $url = parse_url ("http://" .$canonicalServerName .$requestUri); $requestUriPath = $url["path"]; if (substr($requestUriPath, -1, 1) != "/") { $isFile = FALSE; foreach($fileExtensions as $fileExtension) { if ( strtolower(substr($requestUriPath, strlen($fileExtension) * -1, strlen($fileExtension))) == strtolower($fileExtension) ) { $isFile = TRUE; } } if (!$isFile) { $requestUriPath .= "/"; $doRedirect = TRUE; } } $canonicalUrl = "http://" .$canonicalServerName .$requestUriPath; if ($queryString) { $canonicalUrl .= "?" . $queryString; } if ($url["fragment"]) { $canonicalUrl .= "#" . $url["fragment"]; } if ($doRedirect) { @header("HTTP/1.1 301 Moved Permanently", TRUE, 301); @header("Location: $canonicalUrl"); exit; }
Check your permalink settings and edit the values of $fileExtensions and $canonicalServerName accordingly. For other CMSs adapt the code, perhaps you need to change the handling of query strings and fragments. The code above will not run under IIS, because it has no REQUEST_URI variable.

Why stealing trailing slashes is not cool

This section expressed in one sentence: Cool URLs don’t change, hence changing other people’s URLs is not cool.

Folks should understand the “U” in URL as unique. Each URL addresses one and only one particular resource. Technically spoken, if you change one single character of an URL, the altered URL points to a different resource, or nowhere.

Think of URLs as phone numbers. When you call 555-0100 you reach the switchboard, 555-0101 is the fax, and 555-0109 is the phone extension of somebody. When you steal the last digit, dialing 555-010, you get nowhere.

Yahoo'ish fools steal our trailing slashes Only a fool would assert that a phone number shortened by one digit is way cooler than the complete phone number that actually connects somewhere. Well, the last digit of a phone number and the trailing slash of a directory link aren’t much different. If somebody hands out an URL (with trailing slash), then use it as is, or don’t use it at all. Don’t “prettify” it, because any change destroys its serviceability.

If one requests a directory without the trailing slash, most Web servers will just reply to the user agent (brower, screen reader, bot) with a redirect header telling that one must use a trailing slash, then the user agent has to re-issue the request in the formally correct way. From a Webmaster’s perspective, burning resources that thoughtlessly is plain theft. From a user’s perspective, things will often work without the slash, but they’ll be quicker with it. “Often” doesn’t equal “always”:

Some Web servers will serve the 404 page.
Some Web servers will serve the wrong content, because /dir is a valid script, virtual URI, or page that has nothing to do with the index of /dir/.
Many Web servers will respond with a 302 HTTP response code (Found) instead of a correct 301-redirect, so that most search engines discovering the sneakily circumcised URL will index the contents of the canonical URL under the invalid URL. Now all search engine users will request the incomplete URL too, running into unnecessary redirects.
Some Web servers will serve identical contents for /dir and /dir/, that leads to duplicate content issues with search engines that index both URLs from links. Most Web services that rank URLs will assign different scorings to all known URL variants, instead of accumulated rankings to both URLs (which would be the right thing to do, but is technically, well, challenging).
Some user agents can’t handle (301) redirects properly. Exotic user agents might serve the user an empty page or the redirect’s “error message”, and Web robots like the crawlers sent out by Technorati or MSN-LiveSearch hang up respectively process garbage.

Does it really make sense to maliciously manipulate URLs just because some clueless developers say “dude, without the slash it looks way cooler”? Nope. Stealing trailing slashes in general as well as storing amputated URLs is a brain dead approach.

KISS (keep it simple, stupid) is a great principle. “Cosmetic corrections” like trimming URLs add unnecessary complexity that leads to erroneous behavior and requires even more code tweaks. GIGO (garbage in, garbage out) is another great principle that applies here. Smart algos don’t change their inputs. As long as the input is processible, they accept it, otherwise they skip it.

Exceptions

URLs in print, radio, and offline in general, should be truncated in a way that browsers can figure out the location - “domain.co.uk” in print and “domain dot co dot uk” on radio is enough. The necessary redirect is cheaper than a visitor who doesn’t type in the canonical URL including scheme, www-prefix, and trailing slash.

How URL canonicalization seems to irritate Technorati

Due to the not exactly responsively (respectively swamped) Technorati user support parts of this section should be interpreted as educated speculation. Also, I didn’t research enough cases to come to a working theory. So here is just the story “how Technorati fails to deal with my blog”.

When I moved my blog from blogspot to this domain, I’ve enhanced the faulty WordPress URL canonicalization. If any user agent requests http://sebastians-pamphlets.com it gets redirected to http://sebastians-pamphlets.com/. Invalid post/page URLs like http://sebastians-pamphlets.com/about redirect to http://sebastians-pamphlets.com/about/. All redirects are permanent, returning the HTTP response code “301″.

I’ve claimed my blog as http://sebastians-pamphlets.com/, but Technorati shows its URL without the trailing slash. …<div class="url"><a href="http://sebastians-pamphlets.com">http://sebastians-pamphlets.com</a> </div> <a class="image-link" href="/blogs/sebastians-pamphlets.com"><img …

By the way, they forgot dozens of fans (folks who “fave’d” either my old blogspot outlet or this site) too.
Blogs claimed at Technorati

I’ve added a description and tons of tags, that both don’t show up on public pages. It seems my tags were deleted, at least they aren’t visible in edit mode any more.
Edit blog settings at Technorati

Shortly after the submission, Technorati stopped to adjust the reputation score from newly discovered inbound links. Furthermore, the list of my recent posts became stale, although I’ve pinged Technorati with every update, and technorati received my update notifications via ping services too. And yes, I’ve tried manual pings to no avail.

I’ve gained lots of fresh inbound links, but the authority score didn’t change. So I’ve asked Technorati’s support for help. A few weeks later, in December/2007, I’ve got an answer:

I’ve taken a look at the issue regarding picking up your pings for “sebastians-pamphlets.com”. After making a small adjustment, I’ve sent our spiders to revisit your page and your blog should be indexed successfully from now on.

Please let us know if you experience any problems in the future. Do not hesitate to contact us if you have any other questions.

Indeed, Technorati updated the reputation score from “56″ to “191″, and refreshed the list of posts including the most recent one.

Of course the “small adjustment” didn’t persist (I assume that a batch process stole the trailing slash that the friendly support person has added). I’ve sent a follow-up email asking whether that’s a slash issue or not, but didn’t receive a reply yet. I’m quite sure that Technorati doesn’t follow 301-redirects, so that’s a plausible cause for this bug at least.

Since December 2007 Technorati didn’t update my authority score (just the rank goes up and down depending on the number of inbound links Technorati shows on the reactions page - by the way these numbers are often unreal and change in the range of hundreds from day to day).
Blog reactions and authority scoring at Technorati

It seems Technorati didn’t index my posts since then (December/18/2007), so probably my outgoing links don’t count for their destinations.
Stale list of recent posts at Technorati

(All screenshots were taken on February/05/2008. When you click the Technorati links today, it ~~could~~ hopefully will look differently.)

I’m not amused. I’m curious what would happen when I add if (!preg_match("/Technorati/i", "$userAgent")) {/* redirect code */}
to my canonicalization routine, but I can resist to handle particular Web robots. My URL canonicalization should be identical both for visitors and crawlers. Technorati should be able to fix this bug without code changes at my end or weeky support requests. Wishful thinking? Maybe.

Update 2008-03-06: Technorati crawls my blog again. The 301 redirects weren’t the issue. I’ll explain that in a follow-up post soon.

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

52 comments Sebastian | Blogging, Usability, Technorati, Duplicate Content, Web development, .htaccess, Anchor Text, Crap, SEO

Sorry Aaron Wall - I fucked up

Posted on 29 January, 2008

I am sorry, Aaron Wall My somewhat sarcastic post “Avoiding the well known #4 penalty“, where I joked about a possible Google #6 filter and criticized the SEO/Webmaster community for invalid methods of dealing with SERP anomalies, reads like “Aaron Wall is a clueless douche-bag”. Of course that’s not true, I never thought that, and I apologize for damaging Aaron’s reputation so thoughtlessly.

To express that I believe Aaron is a smart and very nice guy, I link to his related great post about things SEOs can learn from search engine bugs and glitches:

Do You Care About Google Glitches? Excerpt:

Glitches reveal engineer intent. And they do it early enough that you have time to change your strategy before your site is permanently filtered or banned. When you get to Google’s size, market share, and have that much data, glitches usually mean something.

To make my point clear: calling a SERP anomaly a filter or penalty unless its intents and causes are properly analyzed, and this analyze is backed up with a reasonable data set, is as thoughtlessly as damaging a fellow SEOs reputation in a way that someone new to the field reading my post and/or comments at Sphinn must think that I’m poking Aaron, although I’m just sick of the almost daily WMW penalty inventions (WMW members -not Aaron!- invented the “Google position #6 penalty / filter” term). The sole reason for mentioning Aaron in my post was that his post (also read this one) triggered a great discussion at Sphinn that I’ve cited in parts.

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

4 comments Sebastian | Folks, Crap, SEO

Google removes the #6 penalty/filter/glitch

Posted on 29 January, 2008

Google removed the position six penalty After the great #6 Penalty SEO Panel Google’s head of the webspam dept. Matt Cutts digged out a misbehaving algo and sent it back to the developers. Two hours ago he stated:

When Barry asked me about “position 6″ in late December, I said that I didn’t know of anything that would cause that. But about a week or so after that, my attention was brought to something that could exhibit that behavior.

We’re in the process of changing the behavior; I think the change is live at some datacenters already and will be live at most data centers in the next few weeks.

So everything is fine now. Matt penalizes the position-six software glitch, and lost top positions will revert to their former rankings in a while. Well, not really. Nobody will compensate income losses, nor the time Webmasters spent on forums discussing a suspected penalty that actually was a bug or a weird side effect. However, kudos to Google for listening to concerns, tracking down and fixing the algo. And thanks for the update, Matt.

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

4 comments Sebastian | Search Quality, SEO, Google

Avoiding the well known #4 SERP-hero-penalty …

Posted on 25 January, 2008

Seb the red claw … I just have to link to North South Media’s neat collection of Search Action Figures.

Paul pretty much dislikes folks who don’t link to him, so Danny Sullivan and Rand Fishkin are well advised to drop a link every now and then, and David Naylor better gives him an interview slot asap.

Google’s numbered “penalties”, esp. #6

As for numeric penalties in general … repeat("Sigh", ∞) … enjoy this brains trust moderated by Marty Weintraub (unauthorized):

Marty: Folks, please welcome Aaron Wall, who recently got his #6 penalty removed!

Audience: clap(26) sphinn(26)

The Gypsy: Sorry Marty but come on… this is complete BS and there is NO freakin #6 filter just like the magical minus 90…900 bla bla bla. These anomalies NEVER have any real consensus on a large enough data set to even be considered a viable theory.

A Red Crab: As long as Bill can’t find a plus|minus-n-raise|penalty patent, or at least a white paper or so leaked out from Google, or for all I care a study that provides proof instead of weird assumptions based on claims of webmasters jumping on todays popular WMW band wagon that aren’t plausible nor verifiable, such beasts don’t exist. There are unexplained effects that might look like a pattern, but in most cases it makes no sense to gather a few examples coming with similarities because we’ll never reach the critical mass of anomalies to discuss a theory worth more than a thumbs-down click.

Marty: Maybe Aaron is joking. Maybe he thinks he has invented the next light bulb.

Gamermk: Aaron is grasping at straws on this one.

Barry Welford: I would like this topic to be seen by many.

Audience: clap(29) sphinn(29)

The Gypsy: It is just some people that have DECIDED on an end result and trying to make various hypothesis fit the situation (you know, like tobacco lobby scientists)… this is simply bad form IMO.

Danny Sullivan: Well, I’ve personally seen this weirdness. Pages that I absolutely thought “what on earth is that doing at six” rather than at the top of the page. Not four, not seven — six. It was freaking weird for several different searches. Nothing competitive, either.

I don’t know that sixth was actually some magic number. Personally, I’ve felt like there’s some glitch or problem with Google’s ranking that has prevented the most authorative page in some instances from being at the top. But something was going on.

Remember, there’s no sandbox, either. We got that for months and months, until eventually it was acknowledge that there were a range of filters that might produce a “sandbox like” effect.

The biggest problem I find with these types of theories is they often start with a specific example, sometimes that can be replicated, then they become a catch-all. Not ranking. Oh, it’s the sandbox. Well no — not if you were an established site, it wasn’t. The sandbox was typicaly something that hit brand new sites. But it became a common excuse for anything, producing confusion.

Jim Boykin: I’ll jump in and say I truely believe in the 6 filter. I’ve seen it. I wouldn’t have believed it if I hadn’t seen it happen to a few sites.

Audience: clap(31) sphinn(31)

A Red Crab: Such terms tend to become a life of their own, IOW an excuse for nearly every way a Webmaster can fuck up rankings. Of course Google’s query engine has thresholds (yellow cards or whatever they call them) that don’t allow some sites to rank above a particular position, but that’s a symtom that doesn’t allow back-references to a particular cause, or causes. It’s speculation as long as we don’t know more.

IncrediBill: I definitely believe it’s some sort of filter or algo tweak but it’s certainly not a penalty which is why I scoff at calling it such. One morning you wake up and Matt has turned all the dials to the left and suddenly some criteria bumps you UP or DOWN. Sites have been going up and down in Google SERPs for years, nothing new or shocking about that and this too will have some obvious cause and effect that could probably be identified if people weren’t using the shotgun approach at changing their site

G1smd: By the time anyone works anything out with Google, they will already be in the process of moving the goalposts to another country.

Slightly Shady SEO: The #6 filter is a fallacy.

Old School: It certainly occured but only affected certain sites.

Danny Sullivan: Perhaps it would have been better called a -5 penalty. Consider. Say Google for some reason sees a domain but decides good, but not sure if I trust it. Assign a -5 to it, and that might knock some things off the first page of results, right?

Look — it could all be coincidence, and it certainly might not necessarily be a penalty. But it was weird to see pages that for the life of me, I couldn’t understand why they wouldn’t be at 1, showing up at 6.

Slightly Shady SEO: That seems like a completely bizarre penalty. Not Google’s style. When they’ve penalized anything in the past, it hasn’t been a “well, I guess you can stay on the frontpage” penalty. It’s been a smackdown to prove a point.

Matt Cutts: Hmm. I’m not aware of anything that would exhibit that sort of behavior.

Audience: Ugh … oohhhh … you weren’t aware of the sandbox, either!

Danny Sullivan: Remember, there’s no sandbox, either. We got that for months and months, until eventually it was acknowledge that there were a range of filters that might produce a “sandbox like” effect.

Audience: Bah, humbug! We so want to believe in our lame excuses …

Tedster: I’m not happy with the current level of analysis, however, and definitely looking for more ideas.

Audience: clap(40) sphinn(40)

Of course the panel above is fictional, respectively assembled from snippets which in some cases change the message when you read them in their context. So please follow the links.

I wouldn’t go that far to say there’s no such thing as a fair amount of Web pages that deserve a #1 spot on Google’s SERPs, but rank #6 for unknown reasons (perhaps link monkey business, staleness, PageRank flow in disarray, anchor text repetitions, …). There’s something worth investigating.

However, I think that labelling a discussion of glitches or maybe filters that don’t behave based on a way too tiny dataset “#6 penalty” leads to the lame excuse for literally anything phenomenon.

Folks who don’t follow the various threads closely enough to spot the highly speculative character of the beast, will take it as fact and switch to winter sleep mode instead of enhancing their stuff like Aaron did. I can’t wait for the first “How to escape the Google -5 penalty” SEO tutorial telling the great unwashed that a “+5″ revisit-after meta tag will heal it.

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

18 comments Sebastian | Ego Food, Folks, Fun, Crap, SEO, Google

« Previous Page 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 Next Page »

Sebastian’s Pamphlets

Archived posts from the 'SEO' Category

About the bad taste of shameless ego food

Crawling vs. Indexing

You can’t escape from Google-Jail when …

@ALL: Give Google your feedback on NOINDEX, but read this pamphlet beforehand!

What is NOINDEX?

How major search engines treat NOINDEX

Matt’s question: How should Google handle NOINDEX in the future?

My take: Yes, No, Depends

A fair solution: NOREFERENCE

Nofollow still means don’t follow, and how to instruct Google to crawl nofollow’ed links nevertheless

Google crawls URIs extracted from somewhat sneaky JavaScript code

Google indexes pages that have only JavaScript links pointing to them

Google doesn’t pass anchor text of nofollow’ed links to the link destination

Google doesn’t treat anchor text of JavaScript links as textual content

Results, conclusions, speculation

How to safely strip a link condom

Link condoms with juicy taste faking good karma

Update your crawler detection: MSN/Live Search announces msnbot/1.1

Why storing URLs with truncated trailing slashes is an utterly idiocy

What is a trailing slash?

How to rescue stolen trailing slashes

Why stealing trailing slashes is not cool

Exceptions

How URL canonicalization seems to irritate Technorati

Sorry Aaron Wall - I fucked up

Google removes the #6 penalty/filter/glitch

Avoiding the well known #4 SERP-hero-penalty …

Google’s numbered “penalties”, esp. #6

Categories

Monthly Archives

Links

RSS Feeds