Archived posts from the 'Google' Category

Google and Yahoo accept undelayed meta refreshs as 301 redirects

Although the meta refresh often gets abused to trick visitors into popup hells by sneaky pages on low-life free hosts (poor man’s cloaking), search engines don’t treat every instance of the meta refresh as Webspam. Folks moving their free hosted stuff to their own domains rely on it to redirect to the new location:
<meta http-equiv=refresh content="0; url=" />

Yahoo clearly states how they treat a zero meta refresh, that is a redirect with a delay of zero seconds:

META Refresh: <meta http-equiv=”refresh” content=…> is recognized as a 301 if it specifies little or no delay or as a 302 if it specifies noticeable delay.

Google is in the process of rewriting their documentation, in the current version of their help documents the meta refresh is not (yet!) mentioned. The Google Mini treats all meta refreshs as 302:

A META tag that specifies http-equiv=”refresh” is handled as a 302 redirect.

but that’s handled differently on the Web. I’ve asked Google’s search evangelist Adam Lasnik and he said:

[The] best idea is to use 301/302s directly whenever possible; otherwise, next best is to do a metarefresh with 0 for a 301. I don’t believe we recommend or support any 302-alternative.

Thanks Adam! I’ll update the last meta refresh thread.

If you have the chance to do 301 redirects don’t mess with the meta refresh. Utilize this method only when there’s absolutely no other chance.

Full stop for search geeks. What follows is an explanation for not that experienced Webmasters in need to move their stuff away from greedy Web content funeral services, aka free hosts of any sort.

Ok, now that we know the major search engines accept an undelayed meta refresh as poor man’s 301 redirect, how should a page having this tag look like in order to act as a provisional permanent redirect? As plain and functional as possible:
<title>Moved to new URL:</title>
<meta http-equiv=refresh content="0; url=" />
<meta name="robots" content="noindex,follow" />
<h1>This page has been moved to</h1>
<p>If your browser doesn't redirect you to the new location please <a href=""><b>click here</b></a>, sorry for the hassles!</p>

As long as the server delivers the content above under the old URL sending a 200-OK, Google’s crawl stats should not list the URL under 404 errors. If it does appear under “Not found”, something went awfully bad, probably on the free host’s side. As long as you’ve control over the account, you must not delete the page because the search engines revisit it from time to time checking whether you still redirect with that URL or not.

[Excursus: When a search engine crawler fetches this page, the server returns a 200-OK because, well, it’s there. Acting as a 301/302 does not make it a standard redirect. That sounds confusing to some people, so here is the technical explanation. Server sided response codes like 200, 302, 301, 404 or 410 are sent by the Web server to the user agent in the HTTP header before the server delivers any page content to the user agent (Web browser, search engine crawler, …). The meta refresh OTOH is a client sided directive telling the user agent to disregard the page’s content and to fetch the given (new) URL to render it instead of the initially requested URL. The browser parses the redirect directive out of the file which was received with a HTTP response code 200 (OK). That’s why you don’t get a 302 or 301 when you use a server header checker.]

When a search engine crawler fetches the page above, that’s just the beginning of a pretty complex process. Search engines are large scaled systems which make use of asynchronous communication between tons of highly specialized programs. The crawler itself has nothing to do with indexing. Maybe it follows server sided redirects instantly, but that’s unlikely with meta refreshs because crawlers just fetch Web contents for unprocessed delivery to a data pool from where all sorts of processes like (vertical) indexers pull their fodder. Deleting a redirecting page in the search index might be done by process A running hourly, whilst process B instructing the crawler to fetch the redirect’s destination runs once a day, then the crawler may be swamped so that it delivers the new content a month later to process C which ran just five minutes before the content delivery and starts again not before next Monday if that’s not a bank holiday…

That means the old page may gets deindexed way before the new URL makes it in the search index. If you change anything during this period, you just confuse the pretty complex chain of processes what means that perhaps the search engine starts over by rolling back all transactions and refetching the redirecting page. Not good. Keep all kind of permanent redirects forever.

Actually, a zero meta refresh works like a 301 redirect because the engines (shall) treat is as a permanent redirect, but it’s not a native 301. In fact, due to so much abuse by spammers it might be considered less reliable than a server sided 301 sent in the HTTP header. Hence you want to express your intention clearly to the engines. You do that with several elements of the meta refresh’ing page:

  • The page title says that the resource was moved and tells the new location. Words like “moved” and “new URL” without surrounding gimmicks clear the message.
  • The zero (second) delay parameter shows that you don’t deliver visible content to (most) human visitors but switch their user agent right to the new URL.
  • The “noindex” robots meta tag telling the engines not to index the actual page’s contents is a signal that you don’t cheat. The “follow” value (referring to links in BODY) is just a fallback mechanismn to ensure that engines having troubles to understand the redirect at least follow and index the “click here” link.
  • The lack of indexable content and keywords makes clear that you don’t try to achieve SE rankings for anything except the new URL.
  • The H1 heading repeating the title tag’s content on the page, visible for users surfing with meta refresh = off, accelerates the message and helps the engines to figure out the seriousness of your intent.
  • The same goes for the text message with a clear call for action underlined with the URL introduced by other elements.

Meta refreshs like other client sided redirects (e.g. window.location = ""; in JavaScript) can be found in every spammer’s toolbox, so don’t leave the outdated content on the page and add a JavaScript redirect only to contentless pages like the sample above. Actually, you don’t need to do that, because the number of users surfing with meta-refresh=off is only a tiny fraction of your visitors, and using JavaScript redirects is way more risky (WRT picky search engines) than a zero meta refresh. Also, JavaScript redirects –if captured by a search engine– should count as 302 and you really don’t want to deal with all the disadvantages of soft redirects.

Another interesting question is whether removing the content from the outdated page makes a difference or not. Doing a mass search+replace to insert the meta tags (refresh and robots) with no further changes to the HTML source might seem attractive from a Webmaster’s perspective. It’s fault-prone however. Creating a list mapping outdated pages to their new locations to feed a quick+dirty desktop program generating the simple HTML code above is actually easier and eliminates a couple points of failure.

Finally: Make use of meta refreshs on free hosts only. Professional hosting firms let you do server sided redirects!

Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments

Google is neat

August/16/2007: I’ve installed WordPress 8 days ago on this brand new domain.
August/17/2007: I’ve submitted an XML sitemap to Google.
August/18/2007: I’ve (somewhat hidden) linked to this domain from my old blog.
August/23/2007: Ms. Googlebot has crawled 749 pages from this blog, 9 pages made it in the Web search index so far.
August/24/2007: I got the very first hit from a Google SERP for [Google is neat]:
Google is neat - my first SERP referrer
Considering the number of results for this search term I think my #4 spot is not too bad, although it’s purely based on BlitzIndexing and certainly not to stay for long. The same post from my old blog ranks #22 for this search, probably caused by its link (via a 301 redirect script) to the new URL.

Interestingly the search query URL in my referrer stats is too clean, it lacks all the gimmicks Google adds when one searches with a browser. So who did that to alert me on the indexing? Thanks for choosing such a neat search term! :)

Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments

Google’s 5 sure-fire steps to safer indexing

Nofollow plagueAre you wondering why Gray Hat Search Engine News (GHN) is so quiet recently?

One reason may be that I’ve borrowed their Google savvy spy. I’ve sent him to Mountain View again to learn more about Google’s nofollow strategy.

He returned with a copy of Google’s recently revised mission statement, discovered in the wastebasket of a conference room near office 211 in building 43. Read the shocking and unbelievable head note printed in bold letters:

Google’s mission is to condomize the world’s information and make it universally uncrawlable and useless.

Read and reread it, then some weird facts begin to make sense. Now you’ll understand why:

  1. The rel-nofollow plague was designed to maximize collateral damage by devaluing all hyperlinked votes by honest users of nearly all platforms you’re using everyday, for example Twitter, Wikipedia, corporate blogs, GoogleGroups … ostensibly to nullify the efforts of a few spammers.
  2. Nobody bothers to comment on your nofollow’ed blog.
  3. Google invented the supplemental index (to store scraped resources suffering from too many condomized links) and why it grows faster than the main index.
  4. Google installed the Bigdaddy infrastructure (to prevent Ms. Googlebot from following nofollow’ed links).
  5. Google switched to BlitzCrawling (to list timely contents for a moment whilst fat resources from large archives get buried in the supplemental index). RIP deep crawler and freshbot.

Seriously, the deep crawler isn’t defunct, it’s called supplemental crawler nowadays, and the freshbot is still alive as Feedfetcher.

Disclaimer: All these hard facts were gathered by torturing sources close to Google, robbery and other unfair methods. If anyone bothers to debunk all that as bad joke, one question still remains: Why does Google next to nothing to stop the nofollow plague? I mean, ongoing mass abuse of rel-nofollow is obviously counterproductive with regard to their real mission.

Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments

Google manifested the axe on reciprocal link exchanges

Yesterday Fantomaster via Threadwatcher pointed me to this page of Google’s Webmaster help system. The cache was a few days old and didn’t show a difference, I don’t archive each and every change of the guidelines, so I asked and a friendly and helpful Googler told me that this item was around for a while now. Today this page made it on Sphinn and probably a few other Webmaster hangouts too.

So what the heck is the scandal all about? When you ask Google for help on “link exchange“, the help machine rattles for a second, sighs, coughs, clears its throat and then yells out the answer in bold letters: “Link schemes“, bah!

Ok, we already knew what Google thinks about artificial linkage: “Don’t participate in link schemes designed to increase your site’s ranking or PageRank”. Honestly, what is the intent when I suggest that you link to me and concurrently I link to you? Yup, it means I boost your PageRank and you boost mine, also we chose some nice anchor text and that makes the link deal perfect. In the eyes of Google even such a tiny deal is a link scheme, because both links weren’t put up for users but for search engines.

Pre-Google this kind of link deal was business as usual and considered natural, but frankly back then the links were exchanged for traffic and not for search engine love. We can rant and argue as much as we want, that will not revert the changed character of link swaps nor Google’s take on manipulative links.

Consequently Google has devalued artificial reciprocal links for ages. Pretty much simplified these links nullify each other in Google’s search index. That goes for tiny sins. Folks raising the concept onto larger link networks got caught too but penalized or even banned for link farming.

Obviously all kinds of link swaps are easy to detect algorithmically, even triangular link deals, three way link exchanges and whatnot. I called that plain vanilla link ’swindles’, but only just recently Google has caught up with a scalable solution and seems to detect and penalize most if not all variants covering the whole search index, thanks to the search quality folks in Dublin and Zurich even overseas in whatever languages.

The knowledge that the days of free link trading are numbered was out for years before the exodus. Artificial reciprocal links as well as other linkage considered link spam by Google was and is a pet peeve of Matt’s team. Google sent lots of warnings, and many sane SEOs and Webmasters heard their traffic master’s voice and acted accordingly. Successful link trading just went underground leaving the great unwashed alone with their obsession about exchanging reciprocal links in the public.

Also old news is, that Google does not penalize reciprocal links in general. Google almost never penalizes a pattern or a technique. Instead they try to figure out the Webmaster’s intent and judge case by case based on their findings. And yes, that’s doable with algos, perhaps sometimes with a little help from humans to compile the seed, but we don’t know how perfect the algo is when it comes to evaluations of intent. Natural reciprocal links are perfectly fine with Google. That applies to well maintained blogrolls too, despite the often reciprocal character of these links. Reading the link schemes page completely should make that clear.

Google defines link scheme as “[…] Link exchange and reciprocal links schemes (’Link to me and I’ll link to you.’) […]”. The “I link to you and vice versa” part literally addresses link trading of any kind, not a situation where I link to your compelling contents because I like a particular page, and you return the favour later on because you find my stuff somewhat useful. As Perkiset puts it “linking is now supposed to be like that well known sex act, ‘68? - or, you do me and I’ll owe you one’” and there is truth in this analogy. Sometimes a favor will not be returned. That’s the way the cookie crumbles when you’re keen on Google traffic.

The fact that Google openly said that link exchange schemes designed “exclusively for the sake of cross-linking” of any kind violate their guidelines indicates that first they were sure to have invented the catchall algo, and second that they felt safe to launch it without too much collateral damage. Not everybody agrees, I quote Fantomaster’s critique not only because I like his inimitably parlance:

This is essentially a theological debate: Attempting to determine any given action’s (and by inference: actor’s) “intention” (as in “sinning”) is always bound to open a can of worms or two.

It will always have to work by conjecture, however plausible, which makes it a fundamentally tacky, unreliable and arbitrary process.

The delusion that such a task, error prone as it is even when you set the most intelligent and well informed human experts to it (vide e.g. criminal law where “intention” can make all the difference between an indictment for second or first degree murder…) can be handled definitively by mechanistic computer algorithms is arguably the most scary aspect of this inane orgy of technological hubris and naivety the likes of Google are pressing onto us.

I’ve seen some collateral damage already, but pragmatic Webmasters will find –respectively have found long ago– their way to build inbound links under Google’s regime.

And here is the context of Google’s definition link exchanges = link schemes which makes clear that not each and every reciprocal link is evil:

[…] However, some webmasters engage in link exchange schemes and build partner pages exclusively for the sake of cross-linking, disregarding the quality of the links, the sources, and the long-term impact it will have on their sites. This is in violation of Google’s webmaster guidelines and can negatively impact your site’s ranking in search results. Examples of link schemes can include:

• Links intended to manipulate PageRank
• Links to web spammers or bad neighborhoods on the web
• Link exchange and reciprocal links schemes (’Link to me and I’ll link to you.’)
• Buying or selling links […]

Again, please read the whole page.

Bear in mind that all this is Internet history, it just boiled up yesterday as the help page was discovered.

Related article: Eric Ward on reciprocal links, why they do good, and where they do bad.

Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments

NOPREVIEW - The missing X-Robots-Tag

Google provides previews of non-HTML resources listed on their SERPs:View PDF as HTML document
These “view as text” and “view as HTML” links are pretty useful when you for example want to scan a PDF document before you clutter your machine’s RAM with 30 megs of useless digital rights management (aka Adobe Reader). You can view contents even when the corresponding application is not installed, Google’s transformed previews should not stuff your maiden box with unwanted malware, etcetera. However, under some circumstances it would make sound sense to have a NOPREVIEW X-Robots-Tag, but unfortunately Google forgot to introduce it yet.

Google is rightfully proud of their capability to transform various file formats to readable HTML or plain text: Adobe Portable Document Format (pdf), Adobe PostScript (ps), Lotus 1-2-3 (wk1, wk2, wk3, wk4, wk5, wki, wks, wku), Lotus WordPro (lwp), MacWrite (mw), Microsoft Excel (xls), Microsoft PowerPoint (ppt), Microsoft Word (doc), Microsoft Works (wks, wps, wdb), Microsoft Write (wri), Rich Text Format (rtf), Shockwave Flash (swf), of course Text (ans, txt) plus a couple of “unrecognized” file types like XML. New formats are added from time to time.

According to Adam Lasnik currently there is no way for Webmasters to tell Google not to include the “View as HTML” option. You can try to fool Google’s converters by messing up the non-HTML resource in a way that a sane parser can’t interpret it. Actually, when you search a few minutes you’ll find e.g. PDF files without the preview links on Google’s SERPs. I wouldn’t consider this attempt a bullet-proof nor future-proof tactic though, because Google is pretty intent on improving their conversion/interpretation process.

I like the previews not only because sometimes they allow me to read documents behind a login screen. That’s a loophole Google should close as soon as possible. When for example PDF documents or Excel sheets are crawlable but not viewable for searchers (at least not with the second click) that’s plain annoying both for the site as well as for the search engine user.

With HTML documents the Webmaster can apply a NOARCHIVE crawler directive to prevent non paying visitors from lurking via Google’s cached page copies. Thanks to the newish REP header tags one can do that with non-HTML resources too, but neither NOARCHIVE nor NOSNIPPET etch away the “view-as HTML” link.

<speculation>Is the lack of a NOPREVIEW crawler directive just an oversight, or is it stuck in the pipeline because Google is working on supplemental components and concepts? Google’s yet inconsistent handling of subscription content comes to mind as an ideal playground for such a robots directive in combination with a policy change.</speculation>

Anyways, there is a need for a NOPREVIEW robots tag, so why not implement it now? Thanks in advance.

Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments

Handling Google’s neat X-Robots-Tag - Sending REP header tags with PHP

It’s a bad habit to tell the bad news first, and I’m guilty of that. Yesterday I linked to Dan Crow telling Google that the unavailable_after tag is useless IMHO. So todays post is about a great thing: REP header tags aka X-Robots-Tags, unfortunately mentioned as second news somewhat concealed in Google’s announcement.

The REP is not only a theatre, it stands for Robots Exclusion Protocol (robots.txt and robots meta tag). Everything you can shove into a robots meta tag on a HTML page can now be delivered in the HTTP header for any file type:

  • INDEX|NOINDEX - Tells whether the page may be indexed or not
  • FOLLOW|NOFOLLOW - Tells whether crawlers may follow links provided on the page or not
  • NOODP - tells search engines not to use page titles and descriptions from the ODP on their SERPs.
  • NOYDIR - tells Yahoo! search not to use page titles and descriptions from the Yahoo! directory on the SERPs.
  • NOARCHIVE - Google specific, used to prevent archiving (cached page copy)
  • NOSNIPPET - Prevents Google from displaying text snippets for your page on the SERPs
  • UNAVAILABLE_AFTER: RFC 850 formatted timestamp - Removes an URL from Google’s search index a day after the given date/time

So how can you serve X-Robots-Tags in the HTTP header of PDF files for example? Here is one possible procedure to explain the basics, just adapt it for your needs:

Rewrite all requests of PDF documents to a PHP script knowing wich files must be served with REP header tags. You could do an external redirect too, but this may confuse things. Put this code in your root’s .htaccess:

RewriteEngine On
RewriteBase /pdf
RewriteRule ^(.*)\.pdf$ serve_pdf.php

In /pdf you store some PDF documents and serve_pdf.php:

$requestUri = $_SERVER[’REQUEST_URI’];

if (stristr($requestUri, “my.pdf”)) {
header(’X-Robots-Tag: index, noarchive, nosnippet’, TRUE);
header(’Content-type: application/pdf’, TRUE);

This setup routes all requests of *.pdf files to /pdf/serve_pdf.php which outputs something like this header when a user agent asks for /pdf/my.pdf:

Date: Tue, 31 Jul 2007 21:41:38 GMT
Server: Apache/1.3.37 (Unix) PHP/4.4.4
X-Powered-By: PHP/4.4.4
X-Robots-Tag: index, noarchive, nosnippet
Connection: close
Transfer-Encoding: chunked
Content-Type: application/pdf

You can do that with all kind of file types. Have fun and say thanks to Google :)

Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments

Unavailable_After is totally and utterly useless

I’ve a lot of respect for Dan Crow, but I’m struggling with my understanding, or possible support, of the unavailable_after tag. I don’t want to put my reputation for bashing such initiatives from search engines at risk, so sit back and grab your popcorn, here comes the roasting:

As a Webmaster, I did not find a single scenario where I could or even would use it. That’s because I’m a greedy traffic whore. A bazillion other Webmasters are greedy too. So how the heck is Google going to sell the newish tag to the greedy masses?

Ok, from a search engine’s perspective unavailable_after makes sound sense. Outdated pages bind resources, annoy searchers, and in a row of useless crap the next bad thing after an outdated page is intentional Webspam.

So convincing the great unwashed to put that thingy on their pages inviting friends and family to granny’s birthday party on 25-Aug-2007 15:00:00 EST would improve search quality. Not that family blog owners care about new meta tags, RFC 850-ish date formats, or search engine algos rarely understanding that the announced party is history on Aug/26/2007. Besides there may be painful aftermaths worth submitting a desperate call for aspirins the day after in the comments, what would be news of the day after expiration. Kinda dilemma, isn’t it?

Seriously, unless CMS vendors support the new tag, tiny sites and clique blogs aren’t Google’s target audience. This initiative addresses large sites which are responsible for a huge amount of outdated contents in Google’s search index.

So what is the large site Webmaster’s advantage of using the unavailable_after tag? A loss of search engine traffic. A loss of link juice gained by the expired page. And so on. Losses of any kind are not that helpful when it comes to an overdue raise nor in salary negotiations. Hence the Webmaster asks for the sack when s/he implements Google’s traffic terminator.

Who cares about Google’s search quality problems when it leads to traffic losses? Nobody. Caring Webmasters do the right thing anyway. And they don’t need no more useless meta tags like unavailable_after. “We don’t need no stinking metas” from “Another Brick in the Wall Part Web 2.0″ expresses my thoughts perfectly.

So what separates the caring Webmaster from the ‘ruthless traffic junky’ who Google wants to implement the unavailable_after tag? The traffic junkie lets his stuff expire without telling Google about it’s state, is happy that frustrated searchers click the URL from the SERPs even years after the event, and enjoys the earnings from tons of ads placed above the content minutes after the party was over. Dear Google, you can’t convince this guy.

[It seems this is a post about repetitive “so whats”. And I came to the point before the 4th paragraph … wow, that’s new … and I’ve put a message in the title which is not even meant as link bait. Keep on reading.]

So what does the caring Webmaster do without the newish unavailable_after tag? Business as usual. Examples:

Say I run a news site where the free contents go to the subscription area after a while. I’d closely watch which search terms generate traffic, write a search engine optimized summary containing those keywords, put that on the sales pitch, and move the original article to the archives accessible to subscribers only. It’s not my fault that the engines think they point to the original article after the move. When they recrawl and reindex the page my traffic will increase because my summary fits their needs more perfectly.

Say I run an auction site. Unfortunately particular auctions expire, but I’m sure that the offered products will return to my site. Hence I don’t close the page, but I search my database for similar offerings and promote them under a H3 heading like “[product] (stuffed keywords) is hot” /H3 P buy [product] here: /P followed by a list of identical products for sale or similar auctions.

Say I run a poll expiring in two weeks. With Google’s newish near real time indexing that’s enough time to collect keywords from my stats, so the textual summary under the poll’s results will attract the engines as well as visitors when the poll is closed. Also, many visitors will follow the links to related respectively new polls.

From Google’s POV there’s nothing wrong with my examples, because the visitor gets what s/he was searching for, and I didn’t cheat. Now tell me, why should I give up these valuable sources of nicely targeted search engine traffic just to make Google happy? Rather I’d make my employer happy. Dear Google, you didn’t convince me.

Update: Tanner Christensen posted a remarkable comment at Sphinn:

I’m sure there is some really great potential for the tag. It’s just none of us have a need for it right now.

Take, for example, when you buy your car without a cup holder. You didn’t think you would use it. But then, one day, you find yourself driving home with three cups of fruit punch and no cup holders. Doh!

I say we wait it out for a while before we really jump on any conclusions about the tag.

John Andrews was the first to report an evil use of unavailable_after.

Also, Dan Crow from Google announced a pretty neat thing in the same post: With the X-Robots-Tag you can now apply crawler directives valid in robots meta tags to non-HTML documents like PDF files or images.

Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments

Analyzing search engine rankings by human traffic

Recently I’ve discussed ranking checkers at several places, and I’m quite astonished that folks still see some value in ranking reports. Frankly, ranking reports are –in most cases– a useless waste of paper and/or disk space. That does not mean that SERP positions per keyword phrase aren’t interesting. They’re just useless without context, that is traffic data. Converting traffic pays the bills, not sole rankings. The truth is in your traffic data.

That said, I’d like to outline a method to get a particular useful information out of raw traffic data: underestimated search terms. That’s not a new attempt, and perhaps you have the reports already, but maybe you don’t look at the information which is somewhat hidden in stats ordered by success, not failure. And you should be –respective employ– a programmer to implement it.

The first step is gathering data. Create a database table to record all hits, then in a footer include or so, when the complete page got outputted already, write all data you have in that table. All data means URL, timestamp, and variables like referrer, user agent, IP, language and so on. Be a data rat, log everything you can get hold of. With dynamic sites it’s easy to add page title, (product) IDs etcetera, with static sites write a tool to capture these attributes separately.

For performance reasons it makes sense to work with a raw data table, which has just a primary key, to log the requests, and normalized working tables which have lots of indexes to allow aggregations, ad hoc queries, and fast reports from different perspectives. Also think of regular purging the raw log table and historization. While transferring raw log data to the working tables in low traffic hours or on another machine you can calculate interesting attributes and add data from other sources which were not available to the logging process.

You’ll need that traffic data collector anyway for a gazillion of purposes where your analytics software fails, is not precise enough, or just can’t deliver a particular evaluation perspective. It’s a prerequisite for the method discussed here, but don’t build a monster sized cannon to chase a fly. You can gather search engine referrer data from logfiles too.

For example an interesting information is on which SERP a user clicked a link pointing to your site. Simplified you need three attributes in your working tables to store this info: search engine, search term, and SERP number. You can extract these values from the HTTP_REFERER.

1. “google” in the server name tells you the search engine.
2. The “q” variable’s value tells you the search term “keyword1 keyword2″.
3. The lack of a “start” variable tells you that the result was placed on the first SERP. The lack of a “num” variable lets you assume that the user got 10 results per SERP, so it’s quite safe to say that you rank in the top 10 for this term. Actually, the number of results per page is not always extractable from the URL because it’s pulled from a cookie usually, but not so many surfers change their preferences (e.g. less than 0.5% surf with 100 results according to JohnMu and my data as well). If you’ve got a “num” value then add 1 and divide the result by 10 to make the data comparable. If that’s not precise enough you’ll spot it afterwards, and you can always recalculate SERP numbers from the canned referrer.

1. and 2. as above.
3. The “start” variable’s value 10 tells you that you got a hit from the second SERP. When start=10 and there is no “num” variable, most probably the searcher got 10 results per page.

1. and 2. as above.
3. The empty “startIndex” variable and startPage=1 are useless, but the lack of “start” and “num” tells you that you’ve got a hit from the 1st spanish SERP.

1. and 2. as above.
3. num=20 tells you that the searcher views 20 results per page, and start=20 indicates the second SERP, so you rank between #21 and #40, thus the (averaged) SERP# is 3.5 (provided SERP# is not an integer in your database).

You got the idea, here is a cheat sheet and official documentation on Google’s URL parameters. Analyze the URLs in your referrer logs and call them with cookies off what disables your personal search preferences, then play with the values. Do that with other search engines too.

Now a subset of your traffic data has a value in “search engine”. Aggregate tuples where search engine is not NULL, then select the results for example where SERP number is lower or equal 3.99 (respectively 4), ordered by SERP number ascending, hits descending and keyword phrase, break by search engine. (Why sorted by traffic descending? You have a report of your best performing keywords already.)

The result is a list of search terms you rank for on the first 4 SERPs, beginning with keywords you’ve probably not optimized for. At least you didn’t optimize the snippet to improve CTR, so your ranking doesn’t generate a reasonable amount of traffic. Before you study the report, throw away your site owner hat and try to think like a consumer. Sometimes those make use of a vocabulary you didn’t think of before.

Research promising keywords, and decide whether you want to push, bury or ignore them. Why bury? Well, in some cases you just don’t want to rank for a particular search term, [your product sucks] being just one example. If the ranking is fine, the search term smells somewhat lucrative, and just the snippet sucks in a particular search query’s context, enhance your SERP listing.

Every once in a while you’ll discover a search term making a killing for your competitors whilst you never spotted it because your stats package reports only the best 500 monthly referrers or so. Also, you’ll get the most out of your rankings by optimizing their SERP CTRs.

Be crative, over time your traffic database becomes more and more valuable, allowing other unconventional and/or site specific reports which off-the-shelf analytics software usually does not deliver. Most probably your competitors use standard analytics software, individually developed algos and reports can make a difference. That does not mean you should throw away your analytics software to reinvent the wheel. However, once you’re used to self developed analytic tools you’ll think of more interesting methods not only to analyse and monitor rankings by human traffic than you can implement in this century ;)

Bear in mind that the method outlined above does not and cannot replace serious keyword research.

Another –very popular– approach to get this info would be automated ranking checks mashed up with hits by keyword phrase. Unfortunately, Google and other engines do not permit automated queries for the purpose of ranking checks, and this method works with preselected keywords, that means you don’t find (all) search terms created by users. Even when you compile your ranking checker’s keyword lists via various keyword research tools, you’ll still miss out on some interesting keywords in your seed list.

Related thoughts: Why regular and automated ranking checks are necessary when you operate seasonal sites by Donna

Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments

Rediscover Google’s free ranking checker!

Nowadays we’re searching via toolbar, personalized homepage, or in the browser address bar by typing in “google” to get the search box, typing in a search query using “I feel lucky” functionality, or -my favorite- typing in

Old fashioned, uncluttered and nevertheless sexy user interfaces are forgotten, and pretty much disliked due to the lack of nifty rounded corners. Luckily Google still maintains them. Look at this beautiful SERP:
Google's free ranking checker
It’s free of personalized search, wonderful uncluttered because the snippets appear as tooltip only, results are nicely numbered from 1 to 1,000 on just 10 awesome fast loading pages, and when I’ve visited my URLs before I spot my purple rankings quickly. is an ideal free ranking checker. It supports &filter=0 and other URL parameters, so it’s a perfect tool when I need to lookup particular search terms.

Mass ranking checks are totally and utterly useless, at least for the average site, and penalized by Google. Well, I can think of ways to semi-automate a couple queries, but honestly, I almost never need that. Providing fully automated ranking reports to clients gave SEO services a more or less well deserved snake oil reputation, because nice rankings for preselected keywords may be great ego food, but they don’t pay the bills. I admit that with some setups automated mass ranking checks make sense, but those are off-topic here.

By the way, Google’s query stats are a pretty useful resource too.

Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments

Blogger to rule search engine visibility?

Via Google’s Webmaster Forum I found this curiosity:

User-agent: *
Disallow: /search
Disallow: /

A standard robots.txt at * looks different:

User-agent: *
Disallow: /search
Sitemap: http://*

According to the blogger the blog is not private, what would explain the crawler blocking:

It is a public blog. In the past it had a standard robots.txt, but 10 days ago it changed to “Disallow: /”

Copyscape thinks that the blog in question shares a fair amount of content with other Web pages. So does blog search:
has a duplicate, posted by the same author, at,
is reprinted at
and so on. Probably a further investigation would reveal more duplicated contents.

It’s understandable that Blogger is not interested in wasting Google’s resources by letting Ms. Googlebot crawl the same contents from different sources. But why do they block other search engines too? And why do they block the source (the posts reprinted at state “Originally posted at [blogspot URL]”)?

Is this really censorship, or just a software glitch, or is it all the blogger’s fault?

Update 07/26/2007: The robots.txt reverted to standard contents for unknown reasons. However, with a shabby link neigborhood as expressed in the blog’s footer I doubt the crawlers will enjoy their visits. At least the indexers will consider this sort of spider fodder nauseous.

Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments

« Previous Page  1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13  Next Page »