Overlooked Duplicated Content Vanishing from Google’s Index

Does Google systematically wipe out duplicated content? If so, does it affect partial dupes too? Will Google apply site-wide ’scraper penalties’ when a particular dupe-threshold gets reached or exceeded?

Following many ‘vanished page posts’ with links on message boards and usenet groups, and monitoring sites I control, I’ve found that indeed there is kinda pattern. It seems that Google is actively wiping dupes out. Those get deleted or stay indexed as ‘URL only’, not moved to the supplemental index.

Example: I have a script listing all sorts of widgets pulled from a database, where users can choose how many items they want to see per page (values for #of widgets/page are hard coded and all linked), combined with prev¦next-page links. This kind of dynamic navigation produces tons of partial dupes (content overlaps with other versions of the same page). Google has indexed way too many permutations of that poorly coded page, and foolishly I didn’t take care of it. Recently I got alerted as Googlebot-Mozilla requested hundreds of versions of this page within a few hours. I’ve quickly changed the script, putting a robots NOINDEX meta tag when the content overlaps, but probably too late. Many of the formerly indexed (cached, appearing with title and snippets on the SERPs) URLs have vanished, respectively became URL-only listings. I expect that I’ll lose a lot of ‘unique’ listings too, because I’ve changed the script in the middle of the crawl.

I’m posting this before I’ve solid data to backup a finding, because it is a pretty common scenario. This kind of navigation is used at online shops, article sites, forums, SERPs … and it applies to aggregated syndicated content too.

I’ve asked Google whether they have a particular recommendation, but no answer yet. Here is my ‘fix’:

Define a straight path thru the dynamic content, where not a single displayed entry overlaps with another page. For example if your default value for items per page is 10, the straight path would be:
start=1&items=10
start=11&items=10
start=21&items=10

Then check the query string before you output the page. If it is part of the straight path, put a INDEX,FOLLOW robots meta tag, otherwise (e.g. start=16&items=15) put NOINDEX.

I don’t know whether this method can help with shops using descriptions pulled from a vendor’s data feed, but I doubt it. If Google can determine and suppress partial dupes within a site, it can do that with text snippets from other sites too. One question remains: how does Google identify the source?

Tags: ()



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

3 Comments to "Overlooked Duplicated Content Vanishing from Google's Index"

  1. John on 27 August, 2005  #link

    Where does actively sabotaging a site by scraping, publishing and linking a duplicate site (before the original eg goes with Google Sitemaps) fit in? If duplicate content is bad, getting original content marked as duplicate is worse. Any idea on how much of this is already going on? :-)

  2. Sebastian on 28 August, 2005  #link

    Nope, I’ve no stats on this.

    Getting your stuff well linked at the time of release and indexed first is the best prevention. It’s hard to prevent from damage like this, when a site providing unique content is overall low ranked in terms of linkpop and authority, or lacks a strong theme.

  3. biz on 11 July, 2009  #link

    What is the risk of publish articles on isnare and ezine articles?

    [Dilution; you really should try to create magnetic content focusing your core expertise at your place. That’ll attract natural inbound links that count for search engine rankings. Reprints might outrank you whilst the links you spread this way get devalued by the engines and/or condomized by the MFA sites that use your submitted articles. You gain more from publishing your intellectual property yourself. If you really want to share your stuff, better make use of RSS feeds licensed under a creative common license.]

    Should I stop publishing them?

    [Yep.]

Leave a reply


[If you don't do the math, or the answer is wrong, you'd better have saved your comment before hitting submit. Here is why.]

Be nice and feel free to link out when a link adds value to your comment. More in my comment policy.