Google’s Anchor Text Reports Encourage Spamming and Scraping

Despite the title-bait Google’s new Webmaster Toy is an interesting tool, thanks! I’ll look at it from a Webmaster’s perspective, all SEO hats in the wardrobe.

Danny tells us more, for example the data source:

A few more details about the anchor text data. First, it comes only from external links to your site. Anchor text you use on your own site isn’t counted.
Second, links to any subdomains you have are NOT included in the data.

From a quick view I doubted that Google filters internal links properly. So I’ve checked a few sites and found internal anchor text leading the new anchor text stats. Well, it’s not unusual that folks copy and paste links including anchor text. Using page titles as anchor text is also business as usual. But why the heck do page titles and shortcuts from vertical menu bars appear to be the most prominent external anchor text? With large sites having tons of inbounds that seems not so easy to investigate.

Thus I’ve looked at a tiny site of mine, this blog. It carries a few recent bollocks posts which were so useless that nobody bothered linking, I thought. Well, partly that’s the case, there were no links by humans, but enough fully automated links to get Ms. Googlebot’s attention. Thanks to scrapers, RSS aggregators, forums linking the author’s recent blog posts and so on, everything on this planet gets linked, so duplicated post titles make it in the anchor text stats.

Back to larger sites I found out that scraper sites and indexed search results were responsible for the extremely misleading ordering of Google’s anchor text excerpts. Both page types should not get indexed in the first place, and it’s ridiculous that crappy data, respectively irrelevantly accumulated data, dilute a well meant effort to report inbound linkage to Webmasters.

Unweighted ordering of inbound anchor text by commonness and limiting the number of listed phrases to 100 makes this report utterly useless for many sites, and so much the worse it sends a strong but wrong signal. Importance in the eye of the beholder gets expressed by the top of an ordered list or result set.

Joe Webmaster tracking down the sources of his top-10 inbound links finds a shitload of low-life pages, thinks hey, linkage is after all a game of large numbers when Google says those links are most important, launches a gazillion of doorways and joins every link farm out there. Bummer. Next week Joe Webmaster pops up in the Google Forum telling the world that his innocent site got tanked because he followed Google’s suggestions and we’re bothered with just another huge but useless thread discussing whether scraper links can damage rankings or not, finally inventing the “buffy blonde girl pointy stick penalty” on page 128.

Roundtrip to hat rack, SEO hat attached. I perfectly understand that Google is not keen on disclosing trust/quality/reputation scores. I can read these stats because I understand the anatomy (intention, data, methods, context), and perhaps I can even get something useful out of them. Explaining this to an impatient site owner who unwillingly bought the link quality counts argument but still believes that nofollow’ed links carry some mysterious weight because they appear in reversed citation results is a completely other story.

Dear Google, if you really can’t hand out information on link weighting by ordering inbound anchor text by trust or other signs of importance, then can you please at least filter out all the useless crap? This shouldn’t be that hard to accomplish, since even simple site-search queries for “scraping” reveal tons of definitely not index-worthy pages which already do not pass any search engine love to the link destinations. Thanks in advance!

When I write “Dear Google”, can Vanessa hear me? Please :)

Tags: ()

Update: Check your anchor text stats page every now and then, don’t miss out on updates :)



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Spamming wannabe SEOs: SEO Affiliates, Cleveland, Ohio

Got this email from a site owner today:

I can put your site at the top of a search engines listings. This is no joke and I can show proven results from all our past clients.
If this is something you might be interested in, send me a reply with the web addresses you want to promote and the best way to contact you with some options.

Thanks in advance,

Sarah Lohman
SEO Affiliates
124 Middle Ave.
Cleveland, OH 44035

I told him long ago that he should dump that site if its affiliate earnings from MSN and Yahoo traffic drop too much, and now some spamming assclowns from Cleveland, Ohio guarantee a #1 listing for crap, sheesh. Here’s the polite reply:

I’m sure you’re joking coz being a that experienced SEO you should have noticed that this site suffers from Google’s death penalty for various good reasons since 2002.
Go figure spammer

Sigh. Off to submit a ton of my email addresses to the spammer’s idiot collector.

Tags: ()



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Why eCommerce Systems Suck at SEO

Listening to whiners and disappointed site owners across the boards I guess in a few weeks we’ll discuss Google’s brand new e-commerce penalties in instances of -30, -900 and -supphell. NOT! A recent algo tweak may have figured out how to identify more crap, but I doubt Google has launched an anti-eCommerce campaign.

One don’t need an award-winning mid-range e-commerce shopping cart like Erol to gain the Google death penalty. Thanks to this award winning software sold as “search engine friendly” on the home page, respectively its crappy architecture (sneaky JS redirects as per Google’s Webmaster guidelines), many innocent shopping sites from Erol’s client list have vanished, or will be deindexed soon. Unbelievable when you read more about their so-called SEO Services. Oh well, so far an actual example. The following comments do not address Erol shopping carts, but e-commerce systems in general.

My usual question when asked to optimize eCommerce sites is “are you willing to dump everything except the core shopping cart module?”. Unfortunately, that’s the best as well as the cheapest solution in most cases. The technical crux with eCommerce software is, that it’s developed by programmers, not Web developers, and software shops don’t bother asking for SEO advice. The result is often fancy crap.

Another common problem is, that the UI is optimized for shoppers (that’s a subclass of ’surfers’, the latter is decently emulated by search engine crawlers). Navigation is mostly shortcut- and search driven (POST created results not crawlable) and relies on variables stored in cookies and whereever (invisible to spiders). All the navigational goodies which make the surfing experience are implemented with client sided technologies, or -if put server sided- served by ugly URLs with nasty session-IDs (ignored by crawlers or at least heavily downranked for various reasons). What’s left for the engines? Deep hierarchical structures of thin pages plastered with duplicated text and buy-now links. That’s not the sort of spider food Ms. Googlebot and her colleagues love to eat.

Guess why Google doesn’t crawl search results. Because search results are an inedible spider fodder not worth indexing. The same goes for badly linked conglomerates of thin product pages. Think of a different approach. Instead of trying to shove thin product pages into search indexes write informative pages on product lines/groups/… and link to the product pages within the text. When these well linked info pages provide enough product details they’ll rank for product related search queries. And you’ll generate linkworthy content. Don’t forget to disallow /shop, /search and /products in your robots.txt.

Disclaimer: I’ve checked essentialaids.com, Erol’s software does JavaScript redirects obfuscating the linked URLs to deliver the content client sided. I’ve followed this case over a few days watching Google deindexing the whole site page by page. This kind of redirects is considered “sneaky” by Google and Google’s spam filters detect it automatically. Although there is no bad intent, Google bans all sites using this technique. Since this is a key feature of the software, how can they advertise it as “search engine friendly”? From their testimonials (most are affiliates) I’ve looked at irishmusicmail.com and found that Google has indexed only 250 pages from well over 800, it looks like the Erol shopping system was removed. The other non-affiliated testimonial is from heroesforkids.co.uk, a badly framed site which is also not viewable without JavaScript. Due to SE-unfriendliness Google has indexed only 50 out of 190 pages (deindexing the site a few days later). Another reference brambleandwillow.com didn’t load at all, Google has no references but I found Erol-styled URLs in Yahoo’s index. Next pensdirect.co.uk suffers from the same flawed architecture as heroesforkids.co.uk, although the pages/indexed-URLs ratio is slightly better (15 of 40+). From a quick look at the Erol JS source all pages will get removed from Google’s search index. I didn’t write that to slander Erol and its inventor Dreamteam UK, however these guys would deserve it. It’s just a warning that good looking software which might perfectly support all related business processes can be extremely destructive from a SEO perspective.

Update: Probably it’s possible to make Erol driven shops compliant to Google’s quality guidelines by creating the pages without a software functionality called “page loading messages”. More information is provided by several posts in Erol’s support forums.

Tags: ()



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Q: Does Googlebot obey ftp://ftp.example.com/robots.txt?

Google states

Search engine robots, including our very own Googlebot, are incredibly polite. They work hard to respect your every wish regarding what pages they should and should not crawl.

A site owner posted logs where Googlebot fetches files although these are disallowed in robots.txt.

Anyone?



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Google Webmasters Help FAQ launched

John Web has launched the Inofficial Google Webmaster FAQs, stop by and contribute your knowledge.

This FAQ was born in a Google Groups thread where regular posters discussed the handling of redundant questions in Google’s Webmaster Help Center.

Tags: ()



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Blogger.com Hates Digg

Ever tried to add a digg button to a blogspot blog? You get a thoughtless error message like “HTML tag <script type=’text/javascript’> is not allowed”. Since when is JavaScript code an HTML tag? BS. <polemic mode>Why does Blogger.com allow JavaScript code made by AdSense?

Conclusion: Blogger.com hates digg.com but loves Google’s AdSense.

Did I hear Googlers encouraging link building tactics like link baiting and social bookmarking? Perhaps I just misunderstood those messages … </polemic mode>

Here is the code downtrodden by Google:

<script type="text/javascript">digg_url = ‘http://sebastianx.blogspot.com/2007/02/why-proper-error-handling-is-important.html‘;</script><script src=”http://digg.com/tools/diggthis.js” type=”text/javascript”></script>


Ok, I admit that Google censors filters blog posts in the best interest of noobs not knowing what a piece of JS code can achieve. However, I’m not an ignorant noob and I do want to make use of JavaScript in my blog posts.

Hey Google, please listen to my rant! Thanks.

PS: Dear Google, please don’t tell me that I can use JS within the template. I don’t think every post of mine is worth a digg button and I’m too lazy to code the conditions.



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Does Adam Lasnik like Rel=Nofollow or not?

Spotting the headline “Google’s Lasnik Wishes ‘NoFollow Didn’t Exist’” I was quite astonished. My first thought was “logic can’t explain such a reversal”. And it turned out as kinda blog hoax.

Adam’s “I wish nofollow didn’t exist” put back in context clarifies Google’s position:

“My core point […] was that it’d be really nice if nofollow wasn’t necessary. As it stands, it’s an admittedly imperfect yet important indicator that helps maintain the quality of the Web for users.

It’d be nice if there was less confusion about what nofollow does and when it’s useful. It’d be great if we could return to a more innocent time when practically all links to other sites really WERE true votes, folks clearly vouching for a site on behalf of their users.

But we don’t live in perfect, innocent times, and at Google we’re dedicated to doing what it takes to improve the signal-to-noise ratio in search quality.”

I like the “admittedly imperfect” piece ;)

Tags: ()



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Google pulls CIA data

Bollocks devoted to Daniel Brandt

Playing with 71 new search keywords by Google I noticed that many search queries get answered with CIA data. Try “national holiday” Canada, Germany, France, Italy and so on, in all cases you get directed to the CIA factbook.



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

How Google’s Web Spam Team finds your link scheme

Natural Search Blog has a nice piece reporting that Matt’s team makes use of a proprietary tool to identify webspam trying to manipulate Google’s PageRank.

Ever wondered why Google catches PR-boosting services scams in no time?



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Why proper error handling is important

Misconfigured servers can prevent search engines from crawling and indexing. I admit that’s news of yesterday. However, standard setups and code copied from low quality resources are underestimated –but very popular– points of failure. According to Google a missing robots.txt file in combination with amateurish error handling can result in invisibility on Google’s SERPs. That’s a very common setup by the way.

Googler Jonathon Simon said:

This way [correct setup] when the Google crawler or other search engine checks for a robots.txt file, they get a 200 response if the file is found and a 404 response if it is not found. If they get a 200 response for both cases then it is ambiguous if your site has blocked search engines or not, reducing the likelihood your site will be fully crawled and indexed.

That’s a very carefully written warning, so I try to rephrase the message between the lines:

If you have no robots.txt and your server responds “Ok” (or 302 on a request of robots.txt followed by a 200 response on request of the error page) when Googlebot tries to fetch it, Googlebot might not be willing to crawl your stuff further, hence your pages will not make it in Google’s search index.

If you don’t suffer from IIS (Windows hosting is a horrible nightmare coming with more pitfalls than countable objects in the universe: go find a reliable host) here is a bullet-proof setup.

If you don’t have a robots.txt file yet, create one and upload it today:

User-agent: *
Disallow:

This tells crawlers that your whole domain is spiderable. If you want to exclude particular pages, file-types or areas of your site, refer to the robots.txt manual.

Next look at the .htaccess file in your server’s Web root directory. If your FTP client doesn’t show it, add “-a” to “external mask” in the settings and reconnect. If you find complete URLs in lines starting with “ErrorDocument”, your error handling is screwed up. What happens is that your server does a soft redirect to the given URL, which probably responds with “200-Ok”, and the actual error code gets lost in cyberspace. Sending 401 errors to absolute URLs will slow your server down to the performance of a single IBM-XT hosting Google.com, all other error directives pointing to absolute URLs result in crap. Here is a well formed .htaccess sample:

ErrorDocument 401 /get-the-fuck-outta-here.html
ErrorDocument 403 /get-the-fudge-outta-here.html
ErrorDocument 404 /404-not-found.html
ErrorDocument 410 /410-gone-forever.html
Options -Indexes
<Files “.ht*”>
deny from all
</Files>
RewriteEngine On
RewriteCond %{HTTP_HOST} !^www\.canonical-server-name\.com [NC]
RewriteRule (.*) http://www.canonical-server-name.com/$1 [R=301,L]

With “ErrorDocument” directives you can capture other clumsiness as well, for example 500 errors with /server-too-buzzy.html or so. Or make the error handling comfortable using /error.php?errno=[insert err#]. In any case avoid relative URLs (src attribute in IMG elements, CSS/feed links, href attributes of A elements …) on all landing pages. You can test actual HTTP response codes with online header checkers.

The other statements above do different neat things. Options -Indexes disallows directory browsing, the next block makes sure that nobody can read your server directives, and the last three lines redirect invalid server names to your canonical server address.

.htaccess is a plain ASCII file, it can get screwed when you upload it in binary mode or when you change it with a word processor. Best edit it with an ASCII/ANSI editor (vi, notepad) as htaccess.txt on your local machine (most FTP clients choose ASCII mode for text files) and rename it to “.htaccess” on the server. Keep in mind that file names are case sensitive.

Tags: ()



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

« Previous Page  1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28  Next Page »