Archived posts from the 'Search Quality' Category

OMFG - Google sends porn punters to my website …

In todays GWC doctor’s office, the webmaster of an innocent orphanage website asks Google’s Matt Cutts:

[My site] is showing up for searches on ‘girls in bathrooms’ because they have an article about renovating the girls bathroom! What do you think of the idea if a negative keyword meta tag to block irrelevant searches? [sic!]

Well, we don’t know what the friendly guy from Google recommends …

… but my dear readers do know that my bullshit detector, faced with such a moronic idea, shouts out in agony:

There’s no such thing as bad traffic, just weak monetizing!

Ok, Ok, Ok … every now and then each and every webmaster out there suffers from misleaded search engine ranking algos, that send shitloads of totally unrelated search traffic. For example, when you search for [how to fuck a click], you won’t expect that Google considers this geeky pamphlet the very best search result. Of course Google should’ve detected your NSFW-typo. Shit happens. Deal with it.

On the other hand, search traffic is free, so there’s no valid reason to complain. Instead of asking Google for a minus-keyword REP directive, one should think of clever ways to monetize unrelated traffic without wasting bandwidth.

You want to monetize irrelevant traffic from searches for smut in a way that nobody can associate your site with porn. That’s doable. Here’s how it works:

Make risk-free beer money from porn traffic with a non-adult site

Copy those slimy phrases from your keyword stats and paste them into Google’s search box. Once you find an adult site that seems to match the smut surfer’s needs better than your site, click on the search result, and on the landing page search for a “webmasters” link that points to their affiliate program. Sign up and save your customized affiliate link.

Next add some PHP code to your scripts. Make absolutely sure it gets executed before you output any other content, even whitespace:

<?php  Show all code

$betterMatch = getOffsiteUri();
if ($betterMatch) {
header("HTTP/1.1 307 Here's your smut", TRUE, 307);
header("Location: $betterMatch");
exit;
}
?>
Refine the simplified code above. Use a database table to store the mappings …

Now a surfer coming from a SERP like
http://google.com/search?num=100&q=nude+teens+in+bathroom&safe=off

will get redirected to
http://someteenpornsite.com/landingpage?affID=4711

You’re using a 307 redirect because it’s not cached by a user agent, so that when you later on find a porn site that converts your traffic better, you can redirect visitors to another URI.

As you probably know, search engines don’t approve duplicate content. Hence it wouldn’t be a bright idea to put up x-rated stuff (all smut is duplicate content by design) onto your site to fulfil the misleaded searcher’s needs.

Of course you can use the technique outlined above to protect searchers from landing on your contact/privacy page, too, when in fact your signup page is their desired destination.

Shiny whitehat disclaimer

If you’re afraid of the possibility that the allmighty Google might punish you for your well meant attempt to fix it’s bugs, relax.

A search engine misinterpreting your content so badly, failed miserably. Your bugfix actually improves their search quality. Search engines can’t force you to report such flaws, they just kindly ask for voluntary feedback.

If search engines dislike smart websites that find related content on the Interwebs in case the search engine delivers shitty search results, they can act themselves. Instead of penalizing webmasters that react to flaws in their algos, they’re well advised to adjust their scoring. I mean, if they stop sending smut traffic to non-porn sites, their users don’t get redirected any longer. It’s that simple.



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Cloaking is good for you. Just ignore Bing’s/Google’s guidelines.

Summary first: If you feel the need to cloak, just do it within reason. Don’t cloak because you can, but because it’s technically the most elegant procedure to accomplish a Web development task. Bing and Google can’t detect your (in no way deceptive) intend algorithmically. Don’t spam away, though, because you might leave trails besides cloaking alone, if you aren’t good enough at spamming search engines. Keep your users interests in mind. Don’t comply to search engine guidelines as set in stone, but to a reasonable level, for example when those force you to comply to Web standards that make more sense than the fancy idea you’ve developed on internationalization, based on detecting browser language settings or so.

search engine guidelines are bullshit WRT cloakingThis pamphlet is an opinion piece. The above said should be considered best practice, even by search engines. Of course it’s not, because search engines can and do fail, just like a webmaster who takes my statement “go cloak away if it makes sense” as technical advice and gets his search engine visibility tanked the hard way.

WTF is cloaking?

Cloaking, also known as IP delivery, means delivering content tailored for specific users who are identified primarily by their IP addresses, but also by user agent (browser, crawler, screen reader…) names, and whatnot. Here’s a simple demonstration of this technique. The content of the next paragraph differs depending on the user requesting this page. Googlebot, Googlers, as well as Matt Cutts at work, will read a personalized message:

Dear visitor, thanks for your visit from 38.107.191.88 (38.107.191.88).

You surely can imagine that cloaking opens a can of worms lots of opportunities to enhance a user’s surfing experience, besides “stalking” particular users like Google’s head of WebSpam.

Why do search engines dislike cloaking?

Apparently they don’t. They use IP delivery themselves. When you’re traveling in europe, you’ll get hints like “go to Google.fr” or “go to Google.at” all the time. That’s google.com checking where you are, trying to lure you into their regional services.

More seriously, there’s a so-called “dark side of cloaking”. Say you’re a seasoned Internet marketer, then you could show Googlebot an educational page with compelling content under an URI like “/games/poker” with an X-Robots-Tag HTTP header telling “noarchive”, whilst surfers (search engine users) supplying an HTTP_REFERER and not coming from employee.google.com get redirected to poker dot com (simplified example).

That’s hard to detect for Google’s WebSpam team. Because they don’t do evil themselves, they can’t officially operate sneaky bots that use for example AOL as their ISP to compare your spider fodder to pages/redirects served to actual users.

Bing sends out spam bots that request your pages “as a surfer” in order to discover deceptive cloaking. Of course those bots can be identified, so professional spammers serve them their spider fodder. Besides burning the bandwidth of non-cloaking sites, Bing doesn’t accomplish anything useful in terms of search quality.

Because search engines can’t detect cloaking properly, not to speak of a cloaking webmaster’s intentions, they’ve launched webmaster guidelines (FUD) that forbid cloaking at all. All Google/Bing reps tell you that cloaking is an evil black hat tactic that will get your site penalized or even banned. By the way, the same goes for perfectly legit “hidden content” that’s invisible on page load, but viewable after a mouse click on a “learn more” widget/link or so.

Bullshit.

If your competitor makes creative use of IP delivery to enhance their visitors’ surfing experience, you can file a spam report for cloaking and Google/Bing will ban the site eventually. Just because cloaking can be used with deceptive intent. And yes, it works this way. See below.

Actually, those spam reports trigger a review by a human, so maybe your competitor gets away with it. But search engines also use spam reports to develop spam filters that penalize crawled pages totally automatted. Such filters can fail, and –trust me– they do fail often. Once you must optimize your content delivery for particular users or user groups yourself, such a filter could tank your very own stuff by accident. So don’t snitch on your competitors, because tomorrow they’ll return the favor.

Enforcing a “do not cloak” policy is evil

At least Google’s WebSpam team comes with cojones. They’ve even banned their very own help pages for “cloaking“, although those didn’t serve porn to minors searching for SpongeBob images with safe-search=on.

That’s overdrawn, because the help files of any Google product aren’t usable without a search facility. When I click “help” in any Google service like AdWords, I get either blank pages, and/or links within the help system are broken because the destination pages were deindexed for cloaking. Plain evil, and counter productive.

Just because Google’s help software doesn’t show ads and related links to Googlebot, those pages aren’t guilty of deceptive cloaking. Ms Googlebot won’t pull the plastic, so it makes no sense to serve her advertisements. Related links are context sensitive just like ads, so it makes no sense to persist them in Google’s crawling cache, or even in Google’s search index. Also, as a user I really don’t care whether Google has crawled the same heading I see on a help page or not, as long as I get directed to relevant content, that is a paragraph or more that answers my question.

When a search engine doesn’t deliver the very best search results intentionally, just because those pages violate an outdated and utterly useless policy that rules fraudulent tactics in a shape lastly used in the last century and doesn’t take into account how the Internet works today, I’m pissed.

Maybe that’s not bad at all when applied to Google products? Bullshit, again. The same happens to any other website that doesn’t fit Google’s weird idea of “serving the same content to users and crawlers”. I mean, as long as Google’s crawlers come from US IPs only, how can a US based webmaster serve the same content in German language to a user coming from Austria and Googlebot, both requesting a URI like “/shipping-costs?lang=de” that has to be different for each user because shipping a parcel to Germany costs $30.00 and a parcel of the same weight shipped to Vienna costs $40.00? Don’t tell me bothering a user with shipping fees for all regions in CH/AT/DE all on one page is a good idea, when I can reduce the information overflow to a tailored info of just one shipping fee that my user expects to see, followed by a link to a page that lists shipping costs for all european countries, or all countries where at least some folks might speak/understand German.

Back to Google’s ban of its very own help pages that hid AdSense code from Googlebot. Of course Google wants to see what surfers see in order to deliver relevant search results, and that might include advertisements. However, surrounding ads don’t necessarily obfuscate the page’s content. Ads served instead of content do. So when Google wants to detect ad laden thin pages, they need to become smarter. Penalizing pages that don’t show ads to search engine crawlers is a bad idea for a search engine, because not showing ads to crawlers is a good idea, not only bandwidth-wise, for a webmaster.

Managing this dichotomy is the search engine’s job. They shouldn’t expect webmasters to help them solving their very own problems (maintaining search quality). In fact, bothering webmasters with policies solely put because search engine algos are fallible and incapable is plain evil. The same applies to instruments like rel-nofollow (launched to help Google devaluing spammy links but backfiring enormously) or Google’s war on paid links (as if not each and every link on the whole Internet is paid/bartered for, somehow).

What do you think, should search engines ditch their way too restrictive “don’t cloak” policies? Click to vote: Stop search engines that tyrannize webmasters!

 

Update 2010-07-06: Don’t miss out on Danny Sullivan’s “Google be fair!” appeal, posted today: Why Google Should Ban Its Own Help Pages — But Also Shouldn’t



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Ditch the spam on SERPs, pretty please?

Say there’s a search engine that tries very hard to serve relevant results for long tail search queries. Maybe it even accepted that an algo change –supposed to wipe out shitloads of thin pages from its long tail search result pages (SERPs)– is referred to as #MayDay. One should think that this search engine isn’t exactly eager to annoy its users with crappy mash-up pages consisting of shabby stuff scraped from all known sources of duplicate content on the whole InterWebs.

Wrong.

Prominent SE spammers like Mahalo still flood the visible part of search indexes with boatloads of crap that never should be able to cheat its way onto any SERP, not even via a [site:spam.com] search. Learn more from Aaron and Michael, who’ve both invested their valuable time to craft out detailled spam reports, to no avail.

Frustrating.

Wait. Why does a bunch of spammy Web pages creates such a fuss? Because they’re findable in the search index. Of course a search engine must crawl all the WebSpam out there, and its indexer has to judge the value of all the content it gets feeded with. But there’s absolutely no need to bother the query engine, that gathers and ranks the stuff presented on the SERPs, with crap like that.

Dear Google, why do you annoy your users with spam created by “a scheme that your automated system handles quite well” at all? Those awesome spam filters should just flag crappy pages as not-SERP-worthy, so that they can never see the daylight at google.com/search. I mean, why should any searcher be at risk of pulling useless search results from your index? Hopefully not because these misleaded searchers tend to click on lots of Google ads on said pages, right?

I’d rather enjoy an empty SERP for an exotic search query, than suffer from a single link to a useless page plastered with huge ads, even if it comes with a tiny portion of stolen content that might be helpful if pointing to the source.

Do you feel like me? Speak out!

Hey Google, I dislike spam on your SERPs! #spam-report Tweet Your Plea For Clean SERPs!



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Google went belly-up: SERPs sneakily redirect to FPAs

I’m pissed. I do know I shouldn’t blog in rage, but Google redirecting search engine result pages to totally useless InternetExplorer ads just fires up my ranting machine.

What does the almighty Google say about URIs that should deliver useful content to searchers, but sneakily redirect to full page ads? Here you go. Google’s webmaster guidelines explicitely forbid such black hat tactics:

Don’t use cloaking or sneaky redirects.” Google just did the latter with its very own SERPs. The search interface google.com/ie, out in the wild for nearly a decade, redirects to a piece of sidebar HTML offering a download of IE8 optimized for Google. That’s a helpful redirect for some IE6 users who don’t suffer from an IT department stuck with this outdated browser, but it’s plain misleading in the eyes of all those searchers who appreciated this clean and totally uncluttered search interface. Interestingly, UA cloaking is the only way to heal this sneaky behavior.

Don’t create pages with malicious behavior.” Google’s guilty, too. Instead of checking for the user’s browser, redirecting only IE6 requests from Google’s discontiued IE6 support (IE6 toolbar …) to the IE8 advertisement, whilst all other user agents get their desired search box, respectively their SERPs, under a google.com/search?output=ie&… URI, Google performs an unconditional redirect to a page that’s utterly useless and also totally unexpected for many searchers. I consider misleading redirects malicious.

Avoid links to web spammers or ‘bad neighborhoods’ on the web.” I consider the propaganda for IE that Google displays instead of the search results I’d expect a bad neighborhood on the Web, because IE constantly ignores Web standards, forcing developers and designers to implement superfluous work arounds. (Ok, ok, ok … Google’s lack of geekiness doesn’t exactly count as violation of their webmaster guidelines, but it sounds good, doesn’t it?)

Hey Matt Cutts, about time to ban google.com/ie! Click to tweet that

Google’s very best search interface is history. Here is what you got under
http://www.google.com/ie?num=100&hl=en&safe=off&q=minimalistic
:

Google's famous minimalistic search UI

And here is where Google sneakily redirects you to when you load the SERP link above (even with Chrome!):
http://www.google.com/toolbar/ie8/sidebar.html
:

Google's sneaky IE8 propaganda

It’s sad that a browser vendor like Google (and yes, Google Chrome is my favorite browser) feels the need to mislead its users with propaganda for a competiting browser that’s slower and doesn’t render everything as it should render it. But when this particular browser vendor also leads Web search, and makes use of black hat techniques that it bans webmasters for, then that’s a scandal. So, if you agree, please submit a spam report to Google:

Hey Matt Cutts, about time to ban google.com/ie! #spam-report Tweet Your Spam Report

2010-05-17 I’ve updated this pamphlet because it didn’t explain the “sneakiness” clear enough. As of today, the unconditional redirect is still sneaky IMHO. Google needs to deliver searchers their desired search results, and only stubborn IE6 users ads for a somewhat better browser.

2010-05-18 Q: You’re pissed solely because your SERP scraping scrips broke. A: Glad you’ve asked. Yes, I’ve scraped Google’s /ie search too. Not because I’m a privacy nazi like Daniel Brandt. I’ve just checked (my) rankings. However, when I spotted the redirects I didn’t even remember the location of the scripts that scraped this service, because I didn’t look at ranking reports for years. I’m interested in actual traffic, and revenues. Ego food annoys me. I just love the /ie search interface. So the answer is a bold “no”. I don’t give a fucking dead rat’s ass what ranking reports based on scraped SERPs could tell.



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

The anatomy of a deceptive Tweet spamming Google Real-Time Search

Google real time search spammed and abusedMinutes after the launch of Google’s famous Real Time Search, the Internet marketing community began to spam the scrolling SERPs. Google gave birth to a new spam industry.

I’m sure Google’s WebSpam team will pull the plug sooner or later, but as of today Google’s real time search results are extremely vulnerable to questionable content.

The somewhat shady approach to make creative use of real time search I’m outlining below will not work forever. It can be used for really evil purposes, and Google is aware of the problem. Frankly, if I’d be the Googler in charge, I’d dump the whole real-time thingy until the spam defense lines are rock solid.

Here’s the recipe from Dr Evil’s WebSpam-Cook-Book:

Ingredients

  • 1 popular topic that pulls lots of searches, but not so many that the results scroll down too fast.
  • 1 landing page that makes the punter pull out the plastic in no time.
  • 1 trusted authority page totally lacking commercial intentions. View its source code, it must have a valid TITLE element with an appealing call for action related to your topic in its HEAD section.
  • 1 short domain, 1 cheap Web hosting plan (Apache, PHP), 1 plain text editor, 1 FTP client, 1 Twitter account, and a prize basic coding skills.

Preparation

Create a new text file and name it hot-topic.php or so. Then code:
<?php
$landingPageUri = "http://affiliate-program.com/?your-aff-id";
$trustedPageUri = "http://google.com/something.py";
if (stristr($_SERVER["HTTP_USER_AGENT"], "Googlebot")) {
header("HTTP/1.1 307 Here you go today", TRUE, 307);
header("Location: $trustedPageUri");
}
else {
header("HTTP/1.1 301 Happy shopping", TRUE, 301);
header("Location: $landingPageUri");
}
exit;
?>

Provided you’re a savvy spammer, your crawler detection routine will be a little more complex.

Save the file and upload it, then test the URI http://youspamaw.ay/hot-topic.php in your browser.

Serving

  • Login to Twitter and submit lots of nicely crafted, not too much keyword stuffed messages carrying your spammy URI. Do not use obscene language, e.g. don’t swear, and sail around phrases like ‘buy cheap viagra’ with synonyms like ‘brighten up your girl friend’s romantic moments’.
  • On their SERPs, Google will display the text from the trusted page’s TITLE element, linked to your URI that leads punters to a sales pitch of your choice.
  • Just for entertainment, closely monitor Google’s real time SERPs, and your real-time sales stats as well.
  • Be happy and get rich by end of the week.

Google removes links to untrusted destinations, that’s why you need to abuse authority pages. As long as you don’t launch f-bombs, Google’s profanity filters make flooding their real time SERPs with all sorts of crap a breeze.

Hey Google, for the sake of our children, take that as a spam report!



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Hard facts about URI spam

I stole this pamphlet’s title (and more) from Google’s post Hard facts about comment spam for a reason. In fact, Google spams the Web with useless clutter, too. You doubt it? Read on. That’s the URI from the link above:

http://googlewebmastercentral.blogspot.com/2009/11/hard-facts-about-comment-spam.html?utm_source=feedburner&utm_medium=feed
&utm_campaign=Feed%3A+blogspot%2FamDG+%28Official+Google+Webmaster+Central+Blog%29

GA KrakenI’ve bolded the canonical URI, everything after the questionmark is clutter added by Google.

When your Google account lists both Feedburner and GoogleAnalytics as active services, Google will automatically screw your URIs when somebody clicks a link to your site in a feed reader (you can opt out, see below).

Why is it bad?

FACT: Google’s method to track traffic from feeds to URIs creates new URIs. And lots of them. Depending on the number of possible values for each query string variable (utm_source utm_medium utm_campaign utm_content utm_term) the amount of cluttered URIs pointing to the same piece of content can sum up to dozens or more.

FACT: Bloggers (publishers, authors, anybody) naturally copy those cluttered URIs to paste them into their posts. The same goes for user link drops at Twitter and elsewhere. These links get crawled and indexed. Currently Google’s search index is flooded with 28,900,000 cluttered URIs mostly originating from copy+paste links. Bing and Yahoo didn’t index GA tracking parameters yet.

That’s 29 million URIs with tracking variables that point to duplicate content as of today. With every link copied from a feed reader, this number will increase. Matt Cutts said “I don’t think utm will cause dupe issues” and points to John Müller’s helpful advice (methods a site owner can apply to tidy up Google’s mess).

Maybe Google can handle this growing duplicate content chaos in their very own search index. Lets forget that Google is the search engine that advocated URI canonicalization for ages, invented sitemaps, rel=canonical, and countless high sophisticated algos to merge indexed clutter under the canonical URI. It’s all water under the bridge now that Google is in the create-multiple-URIs-pointing-to-the-same-piece-of-content business itself.

So far that’s just disappointing. To understand why it’s downright evil, lets look at the implications from a technical point of view.

Spamming URIs with utm tracking variables breaks lots of things

Look at this URI: http://www.example.com/search.aspx?Query=musical+mobile?utm_source=Referral&utm_medium=Internet&utm_campaign=celebritybabies

Google added a query string to a query string. Two URI segment delimiters (“?”) can cause all sorts of troubles at the landing page.

Some scripts will process only variables from Google’s query string, because they extract GET input from the URI’s last questionmark to the fragment delimiter “#” or end of URI; some scripts expecting input variables in a particular sequence will be confused at least; some scripts might even use the same variable names … the number of possible errors caused by amateurish extended query strings is infinite. Even if there’s only one “?” delimiter in the URI.

In some cases the page the user gets faced with will lack the expected content, or will display a prominent error message like 404, or will consist of white space only because the underlying script failed so badly that the Web server couldn’t even show a 5xx error.

Regardless whether a landing page can handle query string parameters added to the original URI or not (most can), changing someone’s URI for tracking purposes is plain evil, IMHO, when implemented as opt-out instead of opt-in.

Appended UTM query strings can make trackbacks vanish, too. When a blog checks whether the trackback URI is carrying a link to the blog or not, for example with this plug-in, the comparision can fail and the trackback gets deleted on arrival, without notice. If I’d dig a little deeper, most probably I could compile a huge list of other functionalities on the Internet that are broken by Google’s UTM clutter.

Finally, GoogleAnalytics is not the one and only stats tool out there, and it doesn’t fulfil all needs. Many webmasters rely on simple server reports, for example referrer stats or tools like awstats, for various technical purposes. Broken. Specialized content management tools feeded by real-time traffic data. Broken. Countless tools for linkpop analysis group inbound links by landing page URI. Broken. URI canonicalization routines. Broken, respecively now acting counterproductive with regard to GA reporting. Google’s UTM clutter has impact on lots of tools that make sense in addition to Google Analytics. All broken.

What a glorious mess. Frankly, I’m somewhat puzzled. Google has hired tens of thousands of this planet’s brightest minds –I really mean that, literally!–, and they came out with half-assed crap like that? Un-fucking-believable.

What can I do to avoid URI spam on my site?

Boycott Google’s poor man’s approach to link feed traffic data to Web analytics. Go to Feedburner. For each of your feeds click on “Configure stats” and uncheck “Track clicks as a traffic source in Google Analytics”. Done. Wait for a suitable solution.

If you really can’t live with traffic sources gathered from a somewhat unreliable HTTP_REFERER, and you’ve deep pockets, then hire a WebDev crew to revamp all your affected code. Coward!

As a matter of fact, Google is responsible for this royal pain in the ass. Don’t fix Google’s errors on your site. Let Google do the fault recovery. They own the root of all UTM evil, so they have to fix it. There’s absolutely no reason why a gazillion of webmasters and developers should do Google’s job, again and again.

What can Google do?

Well, that’s quite simple. Instead of adding utterly useless crap to URIs found in feeds, Google can make use of a clever redirect script. When Feedburner serves feed items to anybody, the values of all GA tracking variables are available.

Instead of adding clutter to these URIs, Feedburner could replace them with a script URI that stores the timestamp, the user’s IP addy, and whatnot, then performs a 301 redirect to the canonical URI. The GA script invoked on the landing page can access and process these data quite accurately.

Perhaps this procedure would be even more accurate, because link drops can no longer mimick feed traffic.

Speak out!

So, if you don’t approve that Feedburner, GoogleReader, AdSense4Feeds, and GoogleAnalytics gang rape your well designed URIs, then link out to everything Google with a descriptive query string, like:

I mean, nicely designed canonical URIs should be the search engineer’s porn, so perhaps somebody at Google will listen. Will ya?

Update:2010 SEMMY Nominee

I’ve just added a “UTM Killer” tool, where you can enter a screwed URI and get a clean URI — all ‘utm_’ crap and multiple ‘?’ delimiters removed — in return. That’ll help when you copy URIs from your feedreader to use them in your blog posts.

By the way, please vote up this pamphlet so that I get the 2010 SEMMY Award. Thanks in advance!



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

As if sloppy social media users ain’t bad enough … search engines support traffic theft

Prepare for a dose of techy tin foil hattery. [Skip rant] Again, I’m going to rant about a nightmare that Twitter & Co created with their crappy, thoughtless and shortsighted software designs: URI shorteners (yup, it’s URI, not URL).

don't get seduced by URI shortenersRecap: Each and every 3rd party URI shortener is evil by design. Those questionable services do/will steal your traffic and your Google juice, mislead and piss off your potential visitors customers, and hurt you in countless other ways. If you consider yourself south of sanity, do not make use of shortened URIs you don’t own.

Actually, this pamphlet is not about sloppy social media users who shoot themselves in both feet, and it’s not about unscrupulous micro blogging platforms that force their users to hand over their assets to felonious traffic thieves. It’s about search engines that, in my humble opinion, handle the sURL dilemma totally wrong.

Some of my claims are based on experiments that I’m not willing to reveal (yet). For example I won’t explain sneaky URI hijacking or how I stole a portion of tinyurl.com’s search engine traffic with a shortened URI, passing searchers to a charity site, although it seems the search engine I’ve gamed has closed this particular loophole now. There’re still way too much playgrounds for deceptive tactics involving shortened URIs

How should a search engine handle a shortened URI?

Handling an URI as shortened URL requires a bullet proof method to detect shortened URIs. That’s a breeze.

  • Redirect patterns: URI shorteners receive lots of external inbound links that get redirected to 3rd party sites. Linking pages, stopovers and destination pages usually reside on different domains. The method of redirection can vary. Most URI shorteners perform 301 redirects, some use 302 or 307 HTTP response codes, some frame the destination page displaying ads on the top frame, and I’ve seen even a few of them making use of meta refreshs and client sided redirects. Search engines can detect all those procedures.
  • Link appearance: redirecting URIs that belong to URI shorteners often appear on pages and in feeds hosted by social media services (Twitter, Facebook & Co).
  • Seed: trusted sources like LongURL.org provide lists of domains owned by URI shortening services. Social media outlets providing their own URI shorteners don’t hide server name patterns (like su.pr …).
  • Self exposure: the root index pages of URI shorteners, as well as other pages on those domains that serve a 200 response code, usually mention explicit terms like “shorten your URL” et cetera.
  • URI length: the length of an URI string, if less or equal 20 characters, is an indicator at most, because some URI shortening services offer keyword rich short URIs, and many sites provide natural URIs this short.

Search engine crawlers bouncing at short URIs should do a lookup, following the complete chain of redirects. (Some whacky services shorten everything that looks like an URI, even shortened URIs, or do a lookup themselves replacing the original short URI with another short URI that they can track. Yup, that’s some crazy insanity.)

Each and every stopover (shortened URI) should get indexed as an alias of the destination page, but must not appear on SERPs unless the search query contains the short URI or the destination URI (that means not on [site:tinyurl.com] SERPs, but on a [site:tinyurl.com shortURI] or a [destinationURI] search result page). 3rd party stopovers mustn’t gain reputation (PageRank™, anchor text, or whatever), regardless the method of redirection. All the link juice belongs to the destination page.

In other words: search engines should make use of their knowledge of shortened URIs in response to navigational search queries. In fact, search engines could even solve the problem of vanished and abused short URIs.

Now let’s see how major search engines handle shortened URIs, and how they could improve their SERPs.

Bing doesn’t get redirects at all

Bing 301 messed up SERPsOh what a mess. The candidate from Redmond fails totally on understanding the HTTP protocol. Their search index is flooded with a bazillion of URI-only listings that all do a 301 redirect, more than 200,000 from tinyurl.com alone. Also, you’ll find URIs that do a permanent redirect and have nothing to do with URI shortening in their index, too.

I can’t be bothered with checking what Bing does in response to other redirects, since the 301 test fails so badly. Clicking on their first results for [site:tinyurl.com], I’ve noticed that many lead to mailto://working-email-addy type of destinations. Dear Bing, please remove those search results as soon as possible, before anyone figures out how to use your SERPs/APIs to launch massive email spam campaigns. As for tips on how to improve your short-URI-SERPs, please learn more under Yahoo and Google.

Yahoo does an awesome job, with a tiny exception

Yahoo 301 somewhat OkYahoo has done a better job. They index short URIs and show the destination page, at least via their site explorer. When I search for a tinyURL, the SERP link points to the URI shortener, that could get improved by linking to the destination page.

By the way, Yahoo is the only search engine that handles abusive short-URIs totally right (I will not elaborate on this issue, so please don’t ask for detailled information if you’re not a SE engineer). Yahoo bravely passed the 301 test, as well as others (including pretty evil tactics). I so hope that MSN will adopt Yahoo’s bright logic before Bing overtakes Yahoo search. By the way, that can be accomplished without sending out spammy bots (hint2bing).

Google does it by the book, but there’s room for improvements

Google fails with meritsAs for tinyURLs, Google indexes only pages on the tinyurl.com domain, including previews. Unfortunately, the snippets don’t provide a link to the destination page. Although that’s the expected behavior (those URIs aren’t linked on the crawled page), that’s sad. At least Google didn’t fail on the 301 test.

As for the somewhat evil tactis I’ve applied in my tests so far, Google fell in love with some abusive short-URIs. Google –under particular circumstances– indexes shortened URIs that game Googlebot, having sent SERP traffic to sneakily shortened URIs (that face the searcher with huge ads) instead of the destination page. Since I’ve begun to deploy sneaky sURLs, Google greatly improved their spam filters, but they’re not yet perfect.

Since Google is responsible for most of this planet’s SERP traffic, I’ve put better sURL handling at the very top of my xmas wish list.

About abusive short URIs

Shortened URIs do poison the Internet. They vanish, alter their destination, mislead surfers … in other words they are abusive by definition. There’s no such thing as a persistent short URI!

Long time ago Tim Berners-Lee told you that URI shorteners are evil fucking with URIs is a very bad habit. Did you listen? Do you make use of shortened URIs? If you post URIs that get shortened at Twitter, or if you make use of 3rd party URI shorteners elsewhere, consider yourself trapped into a low-life traffic theft scam. Shame on you, and shame on Twitter & Co.

fight evil URI shortenersBesides my somewhat shady experiments that hijacked URIs, stole SERP positions, and converted “borrowed” SERP traffic, there are so many other ways to abuse shortened URIs. Many of them are outright evil. Many of them do hurt your kids, and mine. Basically, that’s not any search engine’s problem, but search engines could help us getting rid of the root of all sURL evil by handling shortened URIs with common sense, even when the last short URI has vanished.

Fight shortened URIs!

It’s up to you. Go stop it. As long as you can’t avoid URI shortening, roll your own URI shortener and make sure it can’t get abused. For the sake of our children, do not use or support 3rd party URI shorteners. Deprive the livelihood of these utterly useless scumbags.

Unfortunately, as a father and as a webmaster, I don’t believe in common sense applied by social media services. Hence, I see a “Twitter actively bypasses safe-search filters tricking my children into viewing hardcore porn” post coming. Dear Twitter & Co. — and that addresses all services that make use of or transport shortened URIs — put and end to shortened URIs. Now!



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

How to handle a machine-readable pandemic that search engines cannot control

R.I.P. rel-nofollowWhen you’re familiar with my various rants on the ever morphing rel-nofollow microformat infectious link disease, don’t read further. This post is not polemic, ironic, insulting, or otherwise meant to entertain you. I’m just raving about a way to delay the downfall of the InterWeb.

Lets recap: The World Wide Web is based on hyperlinks. Hyperlinks are supposed to lead humans to interesting stuff they want to consume. This simple and therefore brilliant concept worked great for years. The Internet grew up, bubbled a bit, but eventually it gained world domination. Internet traffic was counted, sold, bartered, purchased, and even exchanged for free in units called “hits”. (A “hit” means one human surfer landing on a sales pitch. That is a popup hell designed in a way that somebody involved just has to make a sale).

Then in the past century two smart guys discovered that links scraped from Web pages can be misused to provide humans with very accurate search results. They even created a new currency on the Web, and quickly assigned their price tags to Web pages. Naturally, folks began to trade green pixels instead of traffic. After a short while the Internet voluntarily transferred it’s world domination to the company founded by those two smart guys from Stanford.

Of course the huge amount of green pixel trades made the search results based on link popularity somewhat useless, because the webmasters gathering the most incoming links got the top 10 positions on the search result pages (SERPs). Search engines claimed that a few webmasters cheated on their way to the first SERPs, although lawyers say there’s no evidence of any illegal activities related to search engine optimization (SEO).

However, after suffering from heavy attacks from a whiny blogger, the Web’s dominating search engine got somewhat upset and required that all webmasters have to assign a machine-readable tag (link condom) to links sneakily inserted into their Web pages by other webmasters. “Sneakily inserted links” meant references to authors as well as links embedded in content supplied by users. All blogging platforms, CMS vendors and alike implemented the link condom, eliminating presumably 5.00% of the Web’s linkage at this time.

A couple of months later the world dominating search engine demanded that webmasters have to condomize their banner ads, intercompany linkage and other commercial links, as well as all hyperlinked references that do not count as pure academic citation (aka editorial links). The whole InterWeb complied, since this company controlled nearly all the free traffic available from Web search, as well as the Web’s purchasable traffic streams.

Roughly 3.00% of the Web’s links were condomized, as the search giant spotted that their users (searchers) missed out on lots and lots of valuable contents covered by link condoms. Ooops. Kinda dilemma. Taking back the link condom requirements was no option, because this would have flooded the search index with billions of unwanted links empowering commercial content to rank above boring academic stuff.

So the handling of link condoms in the search engine’s crawling engine as well as in it’s ranking algorithm was changed silently. Without telling anybody outside their campus, some condomized links gained power, whilst others were kept impotent. In fact they’ve developed a method to judge each and every link on the whole Web without a little help from their friends link condoms. In other words, the link condom became obsolete.

Of course that’s what they should have done in the first place, without asking the world’s webmasters for gazillions of free-of-charge man years producing shitloads of useless code bloat. Unfortunately, they didn’t have the balls to stand up and admit “sorry folks, we’ve failed miserably, link condoms are history”. Therefore the Web community still has to bother with an obsolete microformat. And if they –the link comdoms– are not dead, then they live today. In your markup. Hurting your rankings.

If you, dear reader, are a Googler, then please don’t feel too annoyed. You may have thought that you didn’t do evil, but the above said reflects what webmasters outside the ‘Plex got from your actions. Don’t ignore it, please think about it from our point of view. Thanks.

Still here and attentive? Great. Now lets talk about scenarios in WebDev where you still can’t avoid rel-nofollow. If there are any — We’ll see.

PageRank™ sculpting

Dude, PageRank™ sculpting with rel-nofollow doesn’t work for the average webmaster. It might even fail when applied as high sophisticated SEO tactic. So don’t even think about it. Simply remove the rel=nofollow from links to your TOS, imprint, and contact page. Cloak away your links to signup pages, login pages, shopping carts and stuff like that.

Link monkey business

I leave this paragraph empty, because when you know what you do, you don’t need advice.

Affiliate links

There’s no point in serving A elements to Googlebot at all. If you haven’t cloaked your aff links yet, go see a SEO doctor.

Advanced SEO purposes

See above.

So what’s left? User generated content. Lets concentrate our extremely superfluous condomizing efforts on the one and only occasion that might allow to apply rel-nofollow to a hyperlink on request of a major search engine, if there’s any good reason to paint shit brown at all.

Blogging

If you link out in a blog post, then you vouch for the link’s destination. In case you disagree with the link destination’s content, just put the link as

<strong class="blue_underlined" title="http://myworstenemy.org/" onclick="window.location=this.title;">My Worst Enemy</strong>

or so. The surfer can click the link and lands at the estimated URI, but search engines don’t pass reputation. Also, they don’t evaporate link juice, because they don’t interpret the markup as hyperlink.

Blog comments

My rule of thumb is: Moderate, DoFollow quality, DoDelete crap. Install a conditional do-follow plug-in, set everything on moderation, use captchas or something similar, then let the comment’s link juice flow. You can maintain a white list that allows instant appearance of comments from your buddies.

Forums, guestbooks and unmoderated stuff like that

Separate all Web site areas that handle user generated content. Serve “index,nofollow” meta tags or x-robots-headers for all those pages, and link them from a site map or so. If you gather index-worthy content from users, then feed crawlers the content in a parallel –crawlable– structure, without submit buttons, perhaps with links from trusted users, and redirect human visitors to the interactive pages. Vice versa redirect crawlers requesting live pages to the spider fodder. All those redirects go with a 301 HTTP response code.

If you lack the technical skills to accomplish that, then edit your /robots.txt file as follows:

User-agent: Googlebot
# Dear Googlebot, drop me a line when you can handle forum pages
# w/o rel-nofollow crap. Then I'll allow crawling.
# Treat that as conditional disallow:
Disallow: /forum

As soon as Google can handle your user generated content naturally, they might send you a message in their Webmaster console.

Anything else

Judge yourself. Most probably you’ll find a way to avoid rel-nofollow.

Conclusion

Absolutely nobody needs the rel-nofollow microformat. Not even search engines for the sake of their index. Hence webmasters as well as search engines can stop wasting resources. Farewell rel="nofollow", rest in peace. We won’t miss you.



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Vaporize yourself before Google burns your linking power

PIC-1: Google PageRank(tm) 2007I couldn’t care less about PageRank™ sculpting, because a well thought out link architecture does the job with all search engines, not just Google. That’s where Google is right on the money.

They own PageRank™, hence they can burn, evaporate, nillify, and even divide by zero or multiply by -1 as much PageRank™ as they like; of course as long as they rank my stuff nicely above my competitors.

Picture 1 shows Google’s PageRank™ factory as of 2007 or so. Actually, it’s a pretty simplified model, but since they’ve changed the PageRank™ algo anyway, you don’t need to bother with all the geeky details.

As a side note: you might ask why I don’t link to Matt Cutts and Danny Sullivan discussing the whole mess on their blogs? Well, probably Matt can’t afford my advertising rates, and the whole SEO industry has linked to Danny anyway. If you’re nosy, check out my source code to learn more about state of the art linkage very compliant to Google’s newest guidelines for advanced SEOs (summary: “Don’t trust underlined blue text on Web pages any longer!”).

PIC-2: Google PageRank(tm) 2009What really matters is picture 2, revealing Google’s new PageRank™ facilities, silently launched in 2008. Again, geeky details are of minor interest. If you really want to know everything, then search for [operation bendover] at !Yahoo (it’s still top secret, and therefore not searchable at Google).

Unfortunately, advanced SEO folks (whatever that means, I use this term just because it seems to be an essential property assigned to the participants of the current PageRank™ uprising discussion) always try to confuse you with overcomplicated graphics and formulas when it comes to PageRank™. Instead, I ask you to focus on the (important) hard core stuff. So go grab a magnifier, and work out the differences:

  • PageRank™ 2009 in comparision to PageRank™ 2007 comes with a pipeline supplying unlimited fuel. Also, it seems they’ve implemented the green new deal, switching from gas to natural gas. That means they can vaporize way more link juice than ever before.
  • PageRank™ 2009 produces more steam, and the clouds look slightly different. Whilst PageRank™ 2007 ignored nofollow crap as well as links put with client sided scripting, PageRank™ 2009 evaporates not only juice covered with link condoms, but also tons of other permutations of the standard A element.
  • To compensate the huge overall loss of PageRank™ caused by those changes, Google has decided to pass link juice from condomized links to their target URI hidden to Googlebot with JavaScript. Of course Google formerly has recommended the use of JavaScript-links to prevent the webmasters from penalties for so-called “questionable” outgoing links. Just as they’ve not only invented rel-nofollow, but heavily recommended the use of this microformat with all links disliked by Google, and now they take that back as if a gazillion links on the Web could magically change just because Google tweeks their algos. Doh! I really hope that the WebSpam-team checks the age of such links before they penalize everything implemented according to their guidelines before mid-2009 or the InterWeb’s downfall, whatever comes last.

I guess in the meantime you’ve figured out that I’m somewhat pissed. Not that the secretly changed flow of PageRank™ a year ago in 2008 had any impact on my rankings, or SERP traffic. I’ve always designed my stuff with PageRank™ flow in mind, but without any misuses of rel=”nofollow”, so I’m still fine with Google.

What I can’t stand is when a search engine tries to tell me how I’ve to link (out). Google engineers are really smart folks, they’re perfectly able to develop a PageRank™ algo that can decide how much Google-juice a particular link should pass. So dear Googlers, please –WRT to the implementation of hyperlinks– leave us webmasters alone, dump the rel-nofollow crap and rank our stuff in the best interest of your searchers. No longer bother us with linking guidelines that change yearly. It’s not our job nor responsibility to act as your cannon fodder slavish code monkeys when you spot a loophole in your ranking- or spam-detection-algos.

Of course the above said is based on common sense, so Google won’t listen (remember: I’m really upset, hence polemic statements are absolutely appropriate). To prevent webmasters from irrational actions by misleaded search engines, I hereby introduce the

Webmaster guidelines for search engine friendly links

What follows is pseudo-code, implement it with your preferred server sided scripting language.

if (getAttribute($link, 'rel') matches '*nofollow*' &&
    $userAgent matches '*Googlebot*') {
    print '<strong rev="' + getAttribute(link, 'href') + '"'
    + ' style="color:blue; text-decoration:underlined;"'
    + ' onmousedown="window.location=document.getElementById(this.id).rev; "'
    + '>' + getAnchorText($link) + '</strong>';
}
else {
    print $link;
}

Probably it’s a good idea to snip both the onmousedown trigger code as well as the rev attribute, when the script gets executed by Googlebot. Just because today Google states that they’re going to pass link juice to URIs grabbed from the onclick trigger, that doesn’t mean they’ll never look at the onmousedown event or misused (X)HTML attributes.

This way you can deliver Googlebot exactly the same stuff that the punter surfer gets. You’re perfectly compliant to Google’s cloaking restrictions. There’s no need to bother with complicated stuff like iFrames or even disabled blog comments, forums or guestbooks.

Just feed the crawlers with all the crap the search engines require, then concentrate all your efforts on your UI for human vistors. Web robots (bots, crawlers, spiders, …) don’t supply your signup-forms w/ credit card details. Humans do. If you find the time to upsell them while search engines keep you busy with thoughtless change requests all day long.



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

@ALL: Give Google your feedback on NOINDEX, but read this pamphlet beforehand!

Dear Google, please respect NOINDEXMatt Cutts asks us How should Google handle NOINDEX? That’s a tough question worth thinking twice before you submit a comment to Matt’s post. Here is Matt’s question, all the background information you need, and my opinion.

What is NOINDEX?

Noindex is an indexer directive defined in the Robots Exclusion Protocol (REP) from 1996 for use in robots meta tags. Putting a NOINDEX value in a page’s robots meta tag or X-Robots-Tag tells search engines that they shall not index the page content, but may follow links provided on the page.

To get a grip on NOINDEX’s role in the REP please read my Robots Exclusion Protocol summary at SEOmoz. Also, Google experiments with NOINDEX as crawler directive in robots.txt, more on that later.

How major search engines treat NOINDEX

Of course you could read a ton of my pamphlets to extract this information, but Matt’s summary is still accurate and easier to digest:

    [Matt Cutts on August 30, 2006]
  • Google doesn’t show the page in any way.
  • Ask doesn’t show the page in any way.
  • MSN shows a URL reference and cached link, but no snippet. Clicking the cached link doesn’t return anything.
  • Yahoo! shows a URL reference and cached link, but no snippet. Clicking on the cached link returns the cached page.

Personally, I’d prefer it if every search engine treated the noindex meta tag by not showing a page in the search results at all. [Meanwhile Matt might have a slightly different opinion.]

Google’s experimental support of NOINDEX as crawler directive in robots.txt also includes the DISALLOW functionality (an instruction that forbids crawling), and most probably URIs tagged with NOINDEX in robots.txt cannot accumulate PageRank. In my humble opinion the DISALLOW behavior of NOINDEX in robots.txt is completely wrong, and without any doubt in no way compliant to the Robots Exclusion Protocol.

Matt’s question: How should Google handle NOINDEX in the future?

To simplify Matt’s poll, lets assume he’s talking about NOINDEX as indexer directive, regardless where a Webmaster has put it (robots meta tag, X-Robots-Tag, or robots.txt).

The question is whether Google should completely drop a NOINDEX’ed page from our search results vs. show a reference to the page, or something in between?

Here are the arguments, or pros and cons, for each variant:

Google should completely drop a NOINDEX’ed page from their search results

Obviously that’s what most Webmasters would prefer:

This is the behavior that we’ve done for the last several years, and webmasters are used to it. The NOINDEX meta tag gives a good way — in fact, one of the only ways — to completely remove all traces of a site from Google (another way is our url removal tool). That’s incredibly useful for webmasters.

NOINDEX means don’t index, search engines must respect such directives, even when the content isn’t password protected or cloaked away (redirected or hidden for crawlers but not for visitors).

The corner case that Google discovers a link and lists it on their SERPs before the page that carries a NOINDEX directive is crawled and deindexed isn’t crucial, and could be avoided by a (new) NOINDEX indexer directive in robots.txt, which is requested by search engines quite frequently. Ok, maybe Google’s BlitzCrawler™ has to request robots.txt more often then.

Google should show a reference to NOINDEX’ed pages on their SERPs

Search quality and user experience are strong arguments:

Our highest duty has to be to our users, not to an individual webmaster. When a user does a navigational query and we don’t return the right link because of a NOINDEX tag, it hurts the user experience (plus it looks like a Google issue). If a webmaster really wants to be out of Google without even a single trace, they can use Google’s url removal tool. The numbers are small, but we definitely see some sites accidentally remove themselves from Google. For example, if a webmaster adds a NOINDEX meta tag to finish a site and then forgets to remove the tag, the site will stay out of Google until the webmaster realizes what the problem is. In addition, we recently saw a spate of high-profile Korean sites not returned in Google because they all have a NOINDEX meta tag. If high-profile sites like [3 linked examples] aren’t showing up in Google because of the NOINDEX meta tag, that’s bad for users (and thus for Google).

Search quality and searchers’ user experience is also a strong argument for totally delisting NOINDEX’ed pages, because most Webmasters use this indexer directive to keep stuff that doesn’t provide value for searchers out of the search indexes. <polemic>I mean, how much weight have a few Korean sites when it comes to decisions that affect the whole Web?</polemic>

If a Webmaster puts a NOINDEX directive by accident, that’s easy to spot in the site’s stats, considering the volume of traffic that Google controls. I highly doubt that a simple URI reference with an anchor text scrubbed from external links on Google SERPs would heal such a mistake. Also, Matt said that Google could add a NOINDEX check to the Webmaster Console.

The reference to the URI removal tools is out of context, because these tools remove an URI only for a short period of time and all removal requests have to be resubmitted repeatedly every few weeks. NOINDEX on the other hand is a way to keep an URI out of the index as long as this crawler directive is provided.

I’d say the sole argument for listing references to NOINDEX’ed pages that counts is misleading navigational searches. Of course that does not mean that Google may ignore the NOINDEX directive to show –with a linked reference– that they know a resource, despite the fact that the site owner has strictly forbidden such references on SERPs.

Something in between, Google should find a reasonable way to please both Webmasters and searchers

Quoting Matt again:

The vast majority of webmasters who use NOINDEX do so deliberately and use the meta tag correctly (e.g. for parked domains that they don’t want to show up in Google). Users are most discouraged when they search for a well-known site and can’t find it. What if Google treated NOINDEX differently if the site was well-known? For example, if the site was in the Open Directory, then show a reference to the page even if the site used the NOINDEX meta tag. Otherwise, don’t show the site at all. The majority of webmasters could remove their site from Google, but Google would still return higher-profile sites when users searched for them.

Whether or not a site is popular must not impact a search engine’s respect for a Webmaster’s decision to keep search engines, and their users, out of her realm. That reads like “Hey, Google is popular, so we’ve the right to go to Mountain View to pillage the Googleplex, acquiring everything we can steal for the public domain”. Neither Webmasters nor search engines should mimic Robin Hood. Also, lots of Webmasters highly doubt that Google’s idea of (link) popularity should rule the Web. ;)

Whether or not a site is listed in the ODP directory is definitely not an indicator that can be applied here. Last time I looked the majority of the Web’s content wasn’t listed at DMOZ due to the lack of editors and various other reasons, and that includes gazillions of great and useful resources. I’m not bashing DMOZ here, but as a matter of fact it’s not comprehensive enough to serve as indicator for anything, especially not importance and popularity.

I strongly believe that there’s no such thing as a criterion suitable to mark out a two class Web.

My take: Yes, No, Depends

Google could enhance navigational queries –and even “I feel lucky” queries– that lead to a NOINDEX’ed page with a message like “The best matching result for this query was blocked by the site”. I wouldn’t mind if they mention the URI as long as it’s not linked.

In fact, the problem is the granularity of the existing indexer directives. NOINDEX is neither meant for nor capable of serving that many purposes. It is wrong to assign DISALLOW semantics to NOINDEX, and it is wrong to create two classes of NOINDEX support. Fortunately, we’ve more REP indexer directives that could play a role in this discussion.

NOODP, NOYDIR, NOARCHIVE and/or NOSNIPPET in combination with NOINDEX on a site’s home page, that is either a domain or subdomain, could indicate that search engines must not show references to the URI in question. Otherwise, if no other indexer directives elaborate NOINDEX, search engines could show references to NOINDEX’ed main pages. The majority of navigational search queries should lead to main pages, so that would solve the search quality issues.

Of course that’s not precise enough due to the lack of a specific directive that deals with references to forbidden URIs, but it’s way better than ignoring NOINDEX in its current meaning.

A fair solution: NOREFERENCE

If I’d make the decision at Google and couldn’t live with a best matching search result blocked  message, I’d go for a new REP tag:

“NOINDEX, NOREFERENCE” in a robots meta tag –respectively Googlebot meta tag– or X-Robots-Tag forbids search engines to show a reference on their SERPs. In robots.txt this would look like
NOINDEX: /
NOINDEX: /blog/
NOINDEX: /members/

NOREFERENCE: /
NOREFERENCE: /blog/
NOREFERENCE: /members/

Search engines would crawl these URIs, and follow their links as long as there’s no NOFOLLOW directive either in robots.txt or a page specific instruction.

NOINDEX without a NOREFERENCE directive would instruct search engines not to index a page, but allows references on SERPs. Supporting this indexer directive both in robots.txt as well as on-the-page (respectively in the HTTP header for X-Robots-Tags) makes it easy to add NOREFERENCE on sites that hate search engine traffic. Also, a syntax variant like NOINDEX=NOREFERENCE for robots.txt could tell search eniges how they have to treat NOINDEX statements on site level, or even on site area level.

Even more appealing would be NOINDEX=REFERENCE, because only the very few Webmasters that would like to see their NOINDEX’ed URIs on Google’s SERPs would have to add a directive to their robots.txt at all. Unfortunately, that’s not doable for Google unless they can convice three well known Korean sites to edit their robots.txt. ;)

 

By the way, don’t miss out on my draft asking for REP tag support in robots.txt!

Anyway: Dear Google, please don’t touch NOINDEX! :)



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

  1 | 2 | 3  Next Page »