Mozilla-Googlebot Helps with Debugging

Tracking Googlebot-Mozilla is a great way to discover bugs in a Web site. Try it for yourself, filter your logs by her user agent name:

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

Although Googlebot-Mozilla can add pages to the index, I see her mostly digging in ‘fishy’ areas. For example, she explores URLs where I redirect spiders to a page without query string to avoid indexing of duplicate content. She is very interested in pages with a robots NOINDEX,FOLLOW tag, when she knows another page carrying the same content, available from a similar URL but stating INDEX,FOLLOW. She goes after unusual query strings like ‘var=val&&&&’ resulting from a script bug fixed months ago, but still represented by probably thousands of useless URLs in Google’s index. She fetches a page using two different query strings, checking for duplicate content and alerting me to a superflous input variable used in links on a forgotten page. She fetches dead links to read my very informative error page … and her best friend is the AdSense bot since they seem to share IPs as well as the interest in page updates before Googlebot is aware of them.

Tags: ()



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Googlebots go Fishing with Sitemaps

I’ve used Google Sitemaps since it was launched in June. Six weeks later I say ‘Kudos to Google’, because it works even better than expected. Making use of Google Sitemaps is definitely a must, at least for established Web sites (it doesn’t help much with new sites).

From my logging I found some patterns, here is how the Googlebot sisters go fishing:
· Googlebot-Mozilla downloads the sitemaps 6 times per day, every 8 hours 2 fetches like a clockwork (or every 12 hours lately, now up to 4 fetches within a few minutes from the same IP address). Since this behavior is not documented, I recommend the implementation of automated resubmit-pings however.
· Googlebot fetches new and updated pages harvested from the sitemap, at the latest 2 days after inclusion in the XML file, respectively after providing a current last modified value. Time to index is constantly maximal 2 days. There is just one fetch per page (as long as the sitemap doesn’t submit another update), resulting in a complete indexing (Title, snippets, and cached page). Sometimes she ‘forgets’ a sitemap-submitted URL, but fetches it later following links (this happens with very similar new URLs, especially when they differ only in a query string value). She crawls and indexes even (new) orphans (pages not linked from anywhere).
· Googlebot-Mozilla acts as a weasel in Googlebot’s backwash and is suspected to reveal her secrets to AdSense.

Tags: ()



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Is Google Sitemaps an Index Wiper?

A few weeks after the launch of Google Sitemaps, some unlucky site owners participating in Google’s new service, find their pages wiped out from Google’s index.

As for small and new sites, that’s pretty common and has nothing to do with Google Sitemaps. When it comes to established Web sites disappearing partly or even completely, that’s another story.

There is an underestimated and often overseen risk assigned to sitemap submissions. Webmasters who have made use of sitemap tools generating the sitemaps from the web server’s file system, may have submitted ‘unintended spider food’ to Google, and quickly triggered a spam filter.

At least with important sites, it’s a must to double-check the generated sitemaps before they get submitted. Sitemap generators may dig out unlinked and outdated junk, for example spider trap links pages and doorway pages from the last century.

More info here.

Tags: ()



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Optimizing the Number of Words per Page

I’m just in the mood to reveal a search engine optimizing secret, which works fine for Google and other major search engines as well.

The optimal number of words per page gets determined as follows:

  • Resize your browser window to fit 800*600 resolution.
  • Create a page with three columns below the page title in a H1 tag. In the first column put your menu. In the third column put an AdSense ‘Wide Skyscraper’ ad (160*600).
  • Write your copy and paste it into the second column. Reload the page. If the ads match your topic, fine. If not, rewrite your text, or add more, or remove fillers.
  • You have written too much words, if the middle column exceeds the height of the right ad.
  • Link to this page and leave it alone until Googlebot has fetched it 2-4 times. Time to index is 2 days, so reload the page 2 days after the last Googlebot visit to check whether the ads still match your content (respectively the keyword phrase you’re after). If not, tweak your wording.
  • Then re-arrange the ads and move on. You’ve achieved the optimal number of words per page for your keyword phrase.

Actually, the gibberish above is a persiflage on AdSense optimized content sites.

Seriously, a loooong copy with fair link popularity attracts way more SE traffic, especially if it is supported by a few tiny pages which are naturally optimized for particular phrases, for example footnote pages pointing out details, or one-page definitions of particular terms used in the long copy. This structure is comfortable for all users, either for experts on the topic and interested newbies as well, thus search engines honor it. There is no such thing as an optimal number of words per page.

Tags: ()



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Spam Detection is Unethical

While releasing a Googlebot Spoofer allowing my clients to check their browser optimization for search engine crawler responses, I was wondering again why major search engines tolerate hardcore cloaking to a great degree. I can handle my clients’ competition flooding the engines with zillions of doorway pages and alike, so here are no emotions involved. I just cannot understand why the engines don’t enforce compliance to their guidelines. That’s beyond any logic, thus I’m speculating:

They don’t care. If they would go after spamindexing, they would lose a few billions of indexed pages. That’ll be a very bad PR effect, absolutely unacceptable.

They have other priorities. Focusing on local search, they guess the problem solves itself, because it’s not very probable that a spammer resides close to the search engine user seeking a pizza service and landing in a PPC popup hell. Just claim it ain’t broke, so why fix it?

They believe spam detection is unethical. ‘Don’t be evil’ can be interpreted as ‘You can cheat us using black hat methods. We won’t make use of your own medicine to strike back’. Hey, this interpretation makes sense! Translation for non-geeks: ‘Spoofing is as evil as cloaking, cloaking cannot be discovered without spoofing, and since we aren’t evil, we encourage you to cloak’.

Great. Tomorrow I’ll paint my white hat black and add a billion or more sneaky pages to everyone’s index.

Seriously, I’ve a strong gut feeling the above said belongs to the past pretty soon. The engines changing their crawler’s user agent names to ‘Mozilla…’ could learn to render their spider food and to pull it from unexpected IP addresses. With all respect to successful black hat SEOs, I believe that white hat search engine optimization is a good business decision, probably on the long haul even in competitive industries.

Tags: ()



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Green Tranquilizes

Widely ignored by savvy webmasters and search engine experts, every 3-4 months the publishing Internet community celebrates a spectacular event: Google updates its Toolbar-PageRank!

Site owners around the globe hectically visit all their pages to view the magic green presented by Google’s Toolbar. If the green bar enlarged a pixel or even two, they hurry to the next webmaster board praising their genius. Otherwise they post ‘Google is broke’.

Once the Toolbar-PR update is settled, folks in fear of undefined penalties by the allmighty Google check all their outgoing links, removing everything with a PR less than 4/10. Next they add a REL=NOFOLLOW attribute to all their internal links where the link target shows a white or gray bar on the infallible toolbar. Trembling they hide in the ‘linking is evil’ corner of the world wide waste of common sense universe for 3-4 months again.

Where is the call for rationality leading those misguided lemmings back to mother earth? Hey folks, Toolbar-PR is just fun, it means next to nothing. Green tranquilizes, but white or gray is no reason to panic. The funny stuff Google shoves into your toolbar is an outdated snapshot without correlation to current rankings or even real PageRank. It is by no means an indicator how valuable a page is in reality, so please LOL (link out loud) if the page seems to provide a value for your visitors.

As a matter of fact, the above said will change nothing. Green bar fetishists don’t even listen to GoogleGuy posting “This is just plain old normal toolbar PageRank“.

Tags:



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Startup

Since every dog and his grandpa blogs boring low-life details, I guess I’ve to follow the trend. Let’s see whether I find somewhat interesting topics or not.



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

« Previous Page  1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28