Archived posts from the 'Google' Category

Awesome: Ms. Googlebot Provides Reports

Before Yahoo’s Site Explorer goes live, Google provides advanced statistics in the sitemap program. Ms. Googlebot now tells the webmaster which spider food she refused to eat, and why. The ‘lack of detailed stats’ has produced hundreds of confused posts in the Google Sitemaps Group so far.

As Google Sitemaps was announced at June/02/2005, Shiva Shivakumar statedWe are starting with some basic reporting, showing the last time you’ve submitted a Sitemap and when we last fetched it. We hope to enhance reporting over time, as we understand what the webmasters will benefit from“. Google’s Sitemaps team closely monitored the issues and questions brought up by the webmasters, and since August/30/2005 there are enhanced stats. Here is how it works.

Google’s crawler reports provide information on URIs spidered from sitemaps and URIs found during regular crawls by following links, regardless whether the URI is listed in a sitemap or not. Ms. Googlebot’s error reports are accessible for a site’s webmasters only, after a more or less painless verification of ownership. They contain all sorts of errors, for example dead links, conflicts with exclusions in the robots.txt file and even connectivity problems.

Google’s crawler report is a great tool, kudos to the sitemaps team!

More good news from the Sitemaps Blog:
Separate sitemaps for mobile content to enhance a site’s visibility in Google’s mobile search.

Tags: ()



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Serious Disadvantages of Selling Links

There is a pretty interesting discussion going on search engine spam at O’Reilly Radar. This topic is somewhat misleading, the subject is passing PageRank™ by paid ads on popular sites. Read the whole thread, lots of sound folks express their valuable and often fascinating opinions.

My personal statement is a plain “Don’t sell links for passing PageRank™. Never. Period.”, but the intention of ad space purchases isn’t always that clear. If an ad isn’t related to my content, I tend to put client sided affiliate links on my sites, because search engine spiders didn’t follow them for a long time. Well, it’s not that easy any more.

However, Matt Cutts ‘revealed’ an interesting fact in the thread linked above. Google indeed applies no-follow-logic to Web sites selling (at least unrelated) ads:

… [Since September 2003] …parts of perl.com, xml.com, etc. have not been trusted in terms of linkage … . Remember that just because a site shows up for a “link:” command on Google does not mean that it passes PageRank, reputation, or anchortext.

This policy wasn’t really a secret before Matt’s post, because a critical mass of high PR links not passing PR do draw a sharp picture. What many site owners selling links in ads have obviously never considered, is the collateral damage with regard to on site optimization. If Google distrusts a site’s linkage, outbound and internal links have no power. That is the optimization efforts on navigational links, article interlinking etc. are pretty much useless on a site selling links. Internal links not passing relevancy via anchor text is probably worse than the PR loss, because clever SEOs always acquire deep inbound links.

Rescue strategy:

1. Implement the change recommended by Matt Cutts:

Google’s view on this is … selling links muddies the quality of the web and makes it harder for many search engines (not just Google) to return relevant results. The rel=nofollow attribute is the correct answer: any site can sell links, but a search engine will be able to tell that the source site is not vouching for the destination page.

2. Write Google (possibly cc spam report and reinclusion request) that you’ve changed the linkage of your ads.

3. Hope and pray, on failure goto 2.

Tags: ()



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Overlooked Duplicated Content Vanishing from Google’s Index

Does Google systematically wipe out duplicated content? If so, does it affect partial dupes too? Will Google apply site-wide ’scraper penalties’ when a particular dupe-threshold gets reached or exceeded?

Following many ‘vanished page posts’ with links on message boards and usenet groups, and monitoring sites I control, I’ve found that indeed there is kinda pattern. It seems that Google is actively wiping dupes out. Those get deleted or stay indexed as ‘URL only’, not moved to the supplemental index.

Example: I have a script listing all sorts of widgets pulled from a database, where users can choose how many items they want to see per page (values for #of widgets/page are hard coded and all linked), combined with prev¦next-page links. This kind of dynamic navigation produces tons of partial dupes (content overlaps with other versions of the same page). Google has indexed way too many permutations of that poorly coded page, and foolishly I didn’t take care of it. Recently I got alerted as Googlebot-Mozilla requested hundreds of versions of this page within a few hours. I’ve quickly changed the script, putting a robots NOINDEX meta tag when the content overlaps, but probably too late. Many of the formerly indexed (cached, appearing with title and snippets on the SERPs) URLs have vanished, respectively became URL-only listings. I expect that I’ll lose a lot of ‘unique’ listings too, because I’ve changed the script in the middle of the crawl.

I’m posting this before I’ve solid data to backup a finding, because it is a pretty common scenario. This kind of navigation is used at online shops, article sites, forums, SERPs … and it applies to aggregated syndicated content too.

I’ve asked Google whether they have a particular recommendation, but no answer yet. Here is my ‘fix’:

Define a straight path thru the dynamic content, where not a single displayed entry overlaps with another page. For example if your default value for items per page is 10, the straight path would be:
start=1&items=10
start=11&items=10
start=21&items=10

Then check the query string before you output the page. If it is part of the straight path, put a INDEX,FOLLOW robots meta tag, otherwise (e.g. start=16&items=15) put NOINDEX.

I don’t know whether this method can help with shops using descriptions pulled from a vendor’s data feed, but I doubt it. If Google can determine and suppress partial dupes within a site, it can do that with text snippets from other sites too. One question remains: how does Google identify the source?

Tags: ()



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Bait Googlebot With RSS Feeds

Seeing Ms. Googlebot’s sister running wild on RSS feeds, I’m going to assume that RSS feeds may become a valuable tool to support Google’s fresh and deep crawls. Test it for yourself:

Create a RSS feed with a few unlinked or seldom spidered pages which are not included in your XML sitemap. Add the feed to your personalized Google Home Page (’Add Content’ -> ‘Create Section’ -> Enter Feed URL). Track spider accesses to the feed and the included pages as well. Most probably Googlebot will request your feed more often than Yahoo’s FeedSeeker and similar bots. Chances are that Googlebot-Mozilla is nosy enough to crawl at least some of the pages linked in the feed.

That does not help a lot with regard to indexing and ranking, but it seems to be a neat procedure helping the Googlebot sisters spotting fresh content. In real life add the pages to your XML sitemap, link to them and acquire inbound links…

To test the waters, I’ve added RSS generation to my Simple Google Sitemaps Generator. This tool reads a plain page list from a text file, and generates a dynamic XML sitemap, a RSS 2.0 site feed and a hierarchical HTML site map.

Related article on Google’s RSS endeavors: Why Google is an RSS laggard

Tags: ()



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Fight shy of the Google-Update-Hysteria

A post by Aaron Wall about his interview of Dan Theis pointed me to a great paper: How to prosper with the new Google.

I’ve just scanned it, but I’ll print it out and read it in the bath tub later on. That means a lot, because the bath tub is my holy place where I’m safe from crying kids and beeping computers as well.

It seems to me that this 17 pages packed with outstanding analyzes and good advice make all the monster threads on Google updates obsolete. Actually, they are obsolete by itself, because posting in these threads requires an IQ way below a toast, but that’s another story.

Tags: ()



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Mozilla-Googlebot Helps with Debugging

Tracking Googlebot-Mozilla is a great way to discover bugs in a Web site. Try it for yourself, filter your logs by her user agent name:

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

Although Googlebot-Mozilla can add pages to the index, I see her mostly digging in ‘fishy’ areas. For example, she explores URLs where I redirect spiders to a page without query string to avoid indexing of duplicate content. She is very interested in pages with a robots NOINDEX,FOLLOW tag, when she knows another page carrying the same content, available from a similar URL but stating INDEX,FOLLOW. She goes after unusual query strings like ‘var=val&&&&’ resulting from a script bug fixed months ago, but still represented by probably thousands of useless URLs in Google’s index. She fetches a page using two different query strings, checking for duplicate content and alerting me to a superflous input variable used in links on a forgotten page. She fetches dead links to read my very informative error page … and her best friend is the AdSense bot since they seem to share IPs as well as the interest in page updates before Googlebot is aware of them.

Tags: ()



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Googlebots go Fishing with Sitemaps

I’ve used Google Sitemaps since it was launched in June. Six weeks later I say ‘Kudos to Google’, because it works even better than expected. Making use of Google Sitemaps is definitely a must, at least for established Web sites (it doesn’t help much with new sites).

From my logging I found some patterns, here is how the Googlebot sisters go fishing:
· Googlebot-Mozilla downloads the sitemaps 6 times per day, every 8 hours 2 fetches like a clockwork (or every 12 hours lately, now up to 4 fetches within a few minutes from the same IP address). Since this behavior is not documented, I recommend the implementation of automated resubmit-pings however.
· Googlebot fetches new and updated pages harvested from the sitemap, at the latest 2 days after inclusion in the XML file, respectively after providing a current last modified value. Time to index is constantly maximal 2 days. There is just one fetch per page (as long as the sitemap doesn’t submit another update), resulting in a complete indexing (Title, snippets, and cached page). Sometimes she ‘forgets’ a sitemap-submitted URL, but fetches it later following links (this happens with very similar new URLs, especially when they differ only in a query string value). She crawls and indexes even (new) orphans (pages not linked from anywhere).
· Googlebot-Mozilla acts as a weasel in Googlebot’s backwash and is suspected to reveal her secrets to AdSense.

Tags: ()



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Is Google Sitemaps an Index Wiper?

A few weeks after the launch of Google Sitemaps, some unlucky site owners participating in Google’s new service, find their pages wiped out from Google’s index.

As for small and new sites, that’s pretty common and has nothing to do with Google Sitemaps. When it comes to established Web sites disappearing partly or even completely, that’s another story.

There is an underestimated and often overseen risk assigned to sitemap submissions. Webmasters who have made use of sitemap tools generating the sitemaps from the web server’s file system, may have submitted ‘unintended spider food’ to Google, and quickly triggered a spam filter.

At least with important sites, it’s a must to double-check the generated sitemaps before they get submitted. Sitemap generators may dig out unlinked and outdated junk, for example spider trap links pages and doorway pages from the last century.

More info here.

Tags: ()



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Green Tranquilizes

Widely ignored by savvy webmasters and search engine experts, every 3-4 months the publishing Internet community celebrates a spectacular event: Google updates its Toolbar-PageRank!

Site owners around the globe hectically visit all their pages to view the magic green presented by Google’s Toolbar. If the green bar enlarged a pixel or even two, they hurry to the next webmaster board praising their genius. Otherwise they post ‘Google is broke’.

Once the Toolbar-PR update is settled, folks in fear of undefined penalties by the allmighty Google check all their outgoing links, removing everything with a PR less than 4/10. Next they add a REL=NOFOLLOW attribute to all their internal links where the link target shows a white or gray bar on the infallible toolbar. Trembling they hide in the ‘linking is evil’ corner of the world wide waste of common sense universe for 3-4 months again.

Where is the call for rationality leading those misguided lemmings back to mother earth? Hey folks, Toolbar-PR is just fun, it means next to nothing. Green tranquilizes, but white or gray is no reason to panic. The funny stuff Google shoves into your toolbar is an outdated snapshot without correlation to current rankings or even real PageRank. It is by no means an indicator how valuable a page is in reality, so please LOL (link out loud) if the page seems to provide a value for your visitors.

As a matter of fact, the above said will change nothing. Green bar fetishists don’t even listen to GoogleGuy posting “This is just plain old normal toolbar PageRank“.

Tags:



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

« Previous Page  1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13