Overlooked Duplicated Content Vanishing from Google’s Index

Does Google systematically wipe out duplicated content? If so, does it affect partial dupes too? Will Google apply site-wide ’scraper penalties’ when a particular dupe-threshold gets reached or exceeded?

Following many ‘vanished page posts’ with links on message boards and usenet groups, and monitoring sites I control, I’ve found that indeed there is kinda pattern. It seems that Google is actively wiping dupes out. Those get deleted or stay indexed as ‘URL only’, not moved to the supplemental index.

Example: I have a script listing all sorts of widgets pulled from a database, where users can choose how many items they want to see per page (values for #of widgets/page are hard coded and all linked), combined with prev¦next-page links. This kind of dynamic navigation produces tons of partial dupes (content overlaps with other versions of the same page). Google has indexed way too many permutations of that poorly coded page, and foolishly I didn’t take care of it. Recently I got alerted as Googlebot-Mozilla requested hundreds of versions of this page within a few hours. I’ve quickly changed the script, putting a robots NOINDEX meta tag when the content overlaps, but probably too late. Many of the formerly indexed (cached, appearing with title and snippets on the SERPs) URLs have vanished, respectively became URL-only listings. I expect that I’ll lose a lot of ‘unique’ listings too, because I’ve changed the script in the middle of the crawl.

I’m posting this before I’ve solid data to backup a finding, because it is a pretty common scenario. This kind of navigation is used at online shops, article sites, forums, SERPs … and it applies to aggregated syndicated content too.

I’ve asked Google whether they have a particular recommendation, but no answer yet. Here is my ‘fix’:

Define a straight path thru the dynamic content, where not a single displayed entry overlaps with another page. For example if your default value for items per page is 10, the straight path would be:
start=1&items=10
start=11&items=10
start=21&items=10

Then check the query string before you output the page. If it is part of the straight path, put a INDEX,FOLLOW robots meta tag, otherwise (e.g. start=16&items=15) put NOINDEX.

I don’t know whether this method can help with shops using descriptions pulled from a vendor’s data feed, but I doubt it. If Google can determine and suppress partial dupes within a site, it can do that with text snippets from other sites too. One question remains: how does Google identify the source?

Tags: ()



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Information is Temporarily Unavailable

I’ve added my favorite feeds to my personalized Google home page. Unfortunately, “Information is temporarily unavailable” is the most repeated sentence on this page. It takes up to 20 minutes before all feeds are fetched and shown with current headlines. That’s weird, because Googlebot pulls known feeds every 15 minutes, often a few times per second.

It seems to me that Google does not cache (all, if any) RSS feeds. When I refresh the page while watching my crawler monitor screen, I can see Googlebot quickly fetching the ‘unavailable’ feeds from my server. After a while they get updated on the page.

Some feeds can’t be added. For example Google’s news feed ‘http://news.google.com/news?hl=en&ned=us&q=Google+Sitemaps&ie=UTF-8&output=rss&scoring=d’ doesn’t show, not even after dozens of clicks on ‘Go’ during the last week. The same happens every once in a while with MyYahoo too by the way. Microsoft’s sandbox even likes to crash my browser, what I consider the usual behavior of MS products, sandboxed or not.

Tags:



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Matt Cutts Slashdotted?

Matt Cutts starting to blog is great news. Bad news is that the blogosphere and search related sites shortly after the first announcement seem to generate that much traffic, that his blog is currently unreachable. If Google is not willing to host his blog, I’d like to be the first one to donate for a suitable hosting - Matt Cutts’ advice is worth a reasonable donation.

[Update] No slashdotting involved. Matt posts:

The site was down for a few hours today. I had visions of hordes of overenthusiastic SEOs, but my webhost said it was nothing to do with my site specifically–they said their server crashed “catastrophically.” I wanted to ask if they were sure the server wasn’t allergic to me, but they seemed rather busy.

Tags: ()



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Yahoo’s Site Explorer

There is a lot of interesting reading in the SES coverage by Search Engine Roundable. Currently I’m a little sitemap addicted, thus Tim Mayer’s announcement got my attention:

Tim announces a new product named Site Explorer, where you can get your linkage data. It is a place for people to go to see which pages Yahoo indexed and to let Yahoo know about URLs Yahoo has not found as of yet …. He showed an example, you basically type in a URL into it (this is also supported via an API…), then you hit explore URL and it spits out the number of pages found in Yahoo’s index and also shows you the number of inbound links. You can sort pages by “depth” (how deep pages are buried) and you can also submit URLs here. You can also quickly export the results to TSV format.

Sounds like a pretty comfortable tool to do manual submissions, harvest data for link development etc. etc. Unfortunately it’s not yet life, I’d love to read more about the API. The concept outlined above makes me think that I may get an opportunity to shove my fresh content into Yahoo’s index way faster than today, because in comparison to other crawlers Yahoo! Slurp is a little lethargic:

Crawler stats
(tiny site)
Page Fetches robots.txt Fetches
Googlebot 7755 30 73.34 MB 11 Aug 2005 - 00:03
MSNBot 1627 98 39.86 MB 10 Aug 2005 - 23:38
Yahoo! Slurp 385 204 13.61 MB 10 Aug 2005 - 23:53

I may be misleaded here, but Yahoo’s Site Explorer announcement could indicate that Yahoo will not implement Google’s Sitemap Protocol. That’ll be a shame.

Tim Mayer in another SES session:
Q: “Is there a way to do the Google sitemaps type system at Yahoo?”
Tim: We just launched the feed to be able to do that. We will be expanding the products into the future.

Tags: ()



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Good News from Google

Google is always good for a few news: since yesterday news queries are availavle as RSS feed. That’s good news, although Google shoves outdated HTML (font tags and alike) into the item description. It’s good practice to separate the content from its presentation, and hard coded background colors in combination with foreign CSS can screw a page, thus webmasters must extract the text content if they want to make use of Google’s news feeds.

As for Google and RSS, to adjust Ms. Googlebot’s greed on harvested feeds, Google needs to install a ping service. Currently Ms. Googlebot requests feeds way too often, because she spiders them based on guesses and time schedules (one or more fetches every 15 minutes). From my wish list: http://feeds.google.com/ping?feedURI usable for submissions and pings on updates.

Google already makes use of ping technology in the sitemap program, so a ping server shouldn’t be a big issue. Apropos sitemaps: the Google Sitemaps team has launched Inside Google Sitemaps. While I’m on Google bashing, here is a quote from the welcome post (tip: a prominent home link on every page wouldn’t hurt, especially since the title is linked to google.com instead of the blog):

When you submit your Sitemap, you help us learn more about the contents of your site. Participation in this program will not affect your pages’ rankings or cause your pages to be removed from our index.

That’s not always true. Googlebot discovering a whole site will find a lot of stuff which is relevant for rankings, for example anchor text of internal links on formerly unknown pages, and this may improve a site’s overall search engine visibility. On the other hand sitemap based junk submissions can easily tank a site on the SERPs.

Last but not least Google has improved its wildcard search and can tell us now what SEO is all about *. Compare the search result to Google’s official SEO page and wonder.

Tags: ()



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

The Power of Search Related Blogs

Aaron Wall posted a great call to help stop the war in northern Uganda. He argues that when search blogs can get the #1 spot at for a missing member of the community, the same can be done to gain attention to a war where children get abused as cannon fodder. Visit Uganda Conflict Action Network for more information, because you won’t find it at CNN or elsewhere.

Aaron’s call for action:

If you do not like the idea of children being abuducted, murdered, and living in constant fear please help. A few options:

Tags: ()



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Google Has Something Cooking

Covered by a smoke screen, Google performs an internal and top secret summer of coding. Charlie Ayers, Google’s famous chef, has decided to leave the GooglePlex. Larry & Sergey say that hungry engineers do work harder on algo tweaks and celestial projects as well. While SEOs and Webmasters are speculating on the strange behavior of the saucy Googlebot sisters, haggard engineers are cooking their secret sauce in the labs. Under those circumstances, some collateral damage is preprogrammed, but hungry engineers don’t care of a few stinkin’ directories they blow away by accident. Shit happens, don’t worry, failure is automated by Google. Seriously, wait for some exciting improvements in the googlesphere.

Tags: Bollocks removed. Here you go.



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Fresh Content is King

Old news:
A bunch of unique content and a high update frequency increases search engine traffic.
Quite new:
Leading crawlers to fresh content becomes super important.
Future news:
Dynamic Web sites optimized to ping SE crawlers outrank established sites across the boards.

Established methods and tools to support search engine crawlers are clever internal linkage, sitemap networks, ‘What’s new’ pages, inbound links from high ranked and often changed pages etc etc. To a limited degree they still lead crawlers to fresh and not yet spidered old content. Time to crawl and time to index are dissatisfying, because the whole system is based on pulling and depends on the search engine backend system’s ability to guess.

Look back and forth at Google: Google News, Froogle, Sitemaps and rumors on blogsearch indicate a change from progressive pulling of mass data to proactive and event driven picking of fewer fresh data. Google will never stop crawling based on guessing, but has learned how to localize fresh content in no time by making use of submissions and pings.

Blog search engines more or less perfectly fulfil the demand on popular fresh content. The blogosphere pings blog search engines, that is why they are that up to date. The blogosphere is huge and the amount of blog posts is enormous, but it is just a tiny part of the Web. Even more fresh content is still published elsewhere, and elsewhere is the playground of the major search engines, not even touched by blog search engines.

Google wants to dominate search, and currently it does. Google cannot ignore the demand on fresh and popular content, and Google cannot lower the relevancy of search results. Will Google’s future search results be ranked by sort of ‘recent relevancy’ algos? I guess not in general, but ‘recent relevancy’ is not an oxymoron, because Google can learn to determine the type of the requested information and deliver more recent or more relevant results depending on the query context and tracked user behavior. I’m speculating here, but it is plausible and Google already has developed all components necessary to assemble such an algo.

Based on the speculation above, investments in RSS technology and alike should be a wise business decision. If ‘ranking by recent relevancy’ or something similar comes true, dynamic Web sites with the bigger toolset will often outrank the established but more static organized sources of information.

Tags: ()



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Bait Googlebot With RSS Feeds

Seeing Ms. Googlebot’s sister running wild on RSS feeds, I’m going to assume that RSS feeds may become a valuable tool to support Google’s fresh and deep crawls. Test it for yourself:

Create a RSS feed with a few unlinked or seldom spidered pages which are not included in your XML sitemap. Add the feed to your personalized Google Home Page (’Add Content’ -> ‘Create Section’ -> Enter Feed URL). Track spider accesses to the feed and the included pages as well. Most probably Googlebot will request your feed more often than Yahoo’s FeedSeeker and similar bots. Chances are that Googlebot-Mozilla is nosy enough to crawl at least some of the pages linked in the feed.

That does not help a lot with regard to indexing and ranking, but it seems to be a neat procedure helping the Googlebot sisters spotting fresh content. In real life add the pages to your XML sitemap, link to them and acquire inbound links…

To test the waters, I’ve added RSS generation to my Simple Google Sitemaps Generator. This tool reads a plain page list from a text file, and generates a dynamic XML sitemap, a RSS 2.0 site feed and a hierarchical HTML site map.

Related article on Google’s RSS endeavors: Why Google is an RSS laggard

Tags: ()



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Take Free SEO Advice With a Grain of Salt

Yesterday I’ve spotted again how jokes become rumors. It happens every day and sometimes it even hurts. In my post The Top-5 Methods to Attract Search Engine Spiders I was joking about bold dollar signs driving the MSN bot crazy. A few days later I discovered the first Web site making use of the putative ‘$$-trick’. To make a sad story worse, the webmaster has put the dollar signs as hidden text.

This reminds me to the spreading of the ghostly robots revisit META tag. This tag was used by a small regional canadian engine for local indexing in the stone age of the Internet. Today every free META tag generator on the net produces a robots revisit tag. Not a single search engine is interested in this tag. It was never standardized. But it’s present on billions of Web pages.

This is how bad advice becomes popular. Folks read nasty tips and tricks on the net and don’t apply common sense when they implement it. There is no such thing as free and good advice on the net. Even good advice on a particular topic can result in astonishing effets when applied outside its context. It’s impossible to learn SEO from free articles and posts on message boards. Go see a SEO, it’s worth it.

Tags: ()



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

« Previous Page  1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28  Next Page »