How to spam the hell out of Google’s new source attribution meta elements

The moment you’ve read Google’s announcement and Matt’s question “What about spam?” you concluded “spamming it is a breeze”, right? You’re not alone.

Before we discuss how to abuse it, it might be a good idea to define it within its context, ok?

Playground

First of all, Google announced these meta tags on the official Google News blog  for a reason. So when you plan to abuse it with your countless MFA proxies of Yahoo Answers, you most probably jumped on the wrong band wagon. Google supports the meta elements below in Google News only.

syndication-source

The first new indexer hint is syndication-source. It’s meant to tell Google the permalink of a particular news story, hence the author and all the folks spreading the word are asked to use it to point to the one –and only one– URI considered the source:

<meta name="syndication-source" content="http://outerspace.com/news/ubercool-geeks-launched-google-hotpot.html" />

The meta element above is for instances of the story served from
http://outerspace.com/breaking/page1.html
http://outerspace.com/yyyy-mm-dd/page2.html
http://outerspace.com/news/aliens-appreciate-google-hotpot.html
http://outerspace.com/news/ubercool-geeks-launched-google-hotpot.html
http://newspaper.com/main/breaking.html
http://tabloid.tv/rehashed/from/rss/hot:alien-pot-in-your-bong.html

Don’t confuse it with the cross-domain rel-canonical link element. It’s not about canning duplicate content, it marks a particular story, regardless whether it’s somewhat rewritten or just reprinted with a different headline. It tells Google News to use the original URI when the story can be crawled from different URIs on the author’s server, and when syndicated stories on other servers are so similar to the initial piece that Google News prefers to use the original (the latter is my educated guess).

original-source

The second new indexer hint is original-source. It’s meant to tell Google the origin of the news itself, so the author/enterprise digging it out of the mud, as well as all the folks using it later on, are asked to declare who broke the story:

<meta name="original-source" content="http://outerspace.com/news/ubercool-geeks-launched-google-hotpot.html" />

Say we’ve got two or more related news, like “Google fell from Mars” by cnn.com and “Google landed in Mountain View” by sfgate.com, it makes sense for latimes.com to publish a piece like “Google fell from Mars and landed in Mountain View”. Because latimes.com is a serious newspaper, they credit their sources not only with a mention or even embedded links, they do it machine-readable, too:

<meta name="original-source" content="http://cnn.com/google-fell-from-mars.html" />
<meta name="original-source" content="http://sfgate.com/google-landed-in-mountain-view.html" />

It’s a matter of course that both cnn.com and sfgate.com provide such an original-source meta element on their pages, in addition to the syndication-source meta element, both pointing to their very own coverage.

If a journalist grabbed his breaking news from a secondary source telling “CNN reported five minutes ago that Google’s mothership started from Venus, and the LA Times spotted it crashing on Jupiter”, he can’t be bothered with looking at the markup and locating those meta elements in the head section, he has a deadline for his piece “Why Web search left Planet Earth”. It’s just fine with Google News when he puts

<meta name="original-source" content="http://cnn.com/" />
<meta name="original-source" content="http://sfgate.com/" />

Fine-prints

As always, the most interesting stuff is hidden on a help page:

At this time, Google News will not make any changes to article ranking based on this tags.

If we detect that a site is using these metatags inaccurately (e.g., only to promote their own content), we’ll reduce the importance we assign to their metatags. And, as always, we reserve the right to remove a site from Google News if, for example, we determine it to be spammy.

As with any other publisher-supplied metadata, we will be taking steps to ensure the integrity and reliability of this information.

It’s a field test

We think it is a promising method for detecting originality among a diverse set of news articles, but we won’t know for sure until we’ve seen a lot of data. By releasing this tag, we’re asking publishers to participate in an experiment that we hope will improve Google News and, ultimately, online journalism. […] Eventually, if we believe they prove useful, these tags will be incorporated among the many other signals that go into ranking and grouping articles in Google News. For now, syndication-source will only be used to distinguish among groups of duplicate identical articles, while original-source is only being studied and will not factor into ranking. [emphasis mine]

Spam potential

Well, we do know that Google Web search has a spam problem, IOW even a few so-1999-webspam-tactics still work to some extent. So we tend to classify a vague threat like “If we find sites abusing these tags, we may […] remove [those] from Google News entirely” as FUD, and spam away. Common sense and experience tells us that a smart marketer will make money from everything spammable.

But: we’re not talking about Web search. Google News is a clearly laid out environment. There are only so many sites covered by Google News. Even if Google wouldn’t be able to develop algos analyzing all source attribution attributes out there, they do have the resources to identify abuse using manpower alone. Most probably they will do both.

They clearly told us that they will compare those meta data to other signals. And that’s not only very weak indicators like “timestamp first crawled” or “first heard of via pubsubhubbub”. It’s not that hard to isolate particular news, gather each occurrence as well as source mentions within, and arrange those on a time line with clickable links for QC folks who most certainly will identify the actual source. Even a few spot tests daily will soon reveal the sites whose source attribution meta tags are questionable, or even spammy.

If you’re still not convinced, fair enough. Go spam away. Once you’ve lost your entry on the whitelist, your free traffic from Google News, as well as from news-one-box results on conventional SERPs, is toast.

Last but not least, a fair warning

Now, if you still want to use source attribution meta elements on your non-newsworthy MFA sites to claim owership of your scraped content, feel free to do so. Most probably Matt’s team will appreciate just another “I’m spamming Google” signal.

Not that reprinting scraped content is considered shady any more: even a former president does it shamelessly. It’s just the almighty Google in all of its evilness that penalizes you for considering all on-line content public domain.



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

While doing evil, reluctantly: Size, er trust matters.

These Interwebs are a mess. One can’t trust anyone. Especially not link drops, since Twitter decided to break the Web by raping all of its URIs. Twitter’s sloppy URI gangbang became the Web’s biggest and most disgusting clusterfuck in no time.

Evil URI shortenersI still can’t agree to the friggin’ “N” in SNAFU when it comes to URI shortening. Every time I’m doing evil myself at sites like bit.ly, I’m literally vomiting all over the ‘net — in Swahili, er base36 pidgin.

Besides the fact that each and every shortened URI manifests a felonious design flaw, the major concern is that most –if not all– URI shorteners will die before the last URI they’ve shortened is irrevocable dead. And yes, shit happens all day long — RIP tr.im et al.

Letting shit happen is by no means a dogma. We shouldn’t throw away common sense and best practices when it comes to URI management, which, besides avoiding as many redirects as possible, includes risk management:

What if the great chief of Libya all of a sudden decides that gazillions of bit.ly-URIs redirecting punters to their desired smut aren’t exactly compatible to the Qur’an? All your bit.ly URIs will be defunct over night, and because you rely on traffic from places you’ve spammed with your shortened URIs, you’ll be forced to downgrade your expensive hosting plan to a shitty freehost account that displays huge Al-Quaeda or even Weight-Watchers banners above the fold of your pathetic Web pages.

In related news, even the almighty Google just pestered the Interwebs with just another URI shortener’s website: Goo.gl. It promises stability, security, and speed.

Well, at the day it launched, I broke it with recursive chains of redirects, and meanwhile creative folks like Dave Naylor perhaps wrote a guide on “hacking goo.gl for fun and profit”. #abuse

Of course there are bugs in a brand new product. But Google is a company iterating code way faster than most Internet companies, and due to their huge user base and continuous testing under operating conditions they’re aware of most of their bugs. They’ll fix them eventually, and soon goo.gl –as promised– will be “the stablest, most secure, and fastest URL shortener on the Web”.

So, just based on the size of Google’s infrastructure, it seems goo.gl is going to be the most reliable one out of all evil URI shorteners. Kinda queen of all royal PITAs. But is this a good enough reason to actually use goo.gl? Not quite enough, yet.

Go ask a Googler “Can you guarantee that goo.gl will outlive the Internet?”. I got answers like “I agree with your concern. I thought about it myself. But I’m confident Google will try its very best to preserve that”. From an engineer’s perspective, all of them agree with my statement “URI shortening totally sucks ass”. But IRL the Interwebs are flooded with crappy shortURLs, and that’s not acceptable. They figured that URI shortening can’t be eliminated, so it had to be enhanced by a more reliable procedure. Hence bright folks like Muthu Muthusrinivasan, Devin Mullins, Ben D’Angelo et al created goo.gl, with mixed feelings.

That’s why I recommend the lesser evil. Not because Google is huge, has the better infrastructure, picked a better domain, and the whole shebang. I do trust these software engineers, because they think and act like me. Plus, they’ve got the resources.

I’m going goo.gl.
I’ll dump bit.ly etc.

Fineprint: However, I won’t throw away my very own URI shortener, because this evil piece of crap can do things the mainstream URI shorteners –including goo.gl– are still dreaming of, like preventing search angine crawlers from spotting affiliate links and such stuff. Shortening links alone doesn’t equal cloaking fishy links professionally.



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Is Google a search engine based in Mountain View, CA (California, USA)?

Is Google a search engine? Honest answer: Dunno. Google might be a search engine. It could be a fridge, too. Or a yet undiscovered dinosaur, a scary purple man-eater, a prescription drug, or my mom’s worst nightmare.

According to the search engine pre-installed in my browser, “Dogpile” is a search engine, and “Bing”, “Altavista”, even “Wikipedia”. Also a tool called “Google Custom Search” and a popular blog titled “Search Engine Land” are considered search engines, besides obscure Web sites like “Ask”, “DuckDuckGo”, “MetaCrawler” and “Yahoo”. Sorry, I can’t work with these suggestions.

So probably I need to perform a localized search to get an answer:

Is Google a search engine based in Mountain View, CA?

0.19 seconds later my browser’s search facility delivers the desired answer, instantly in near lightning speed. The first result for [Is Google a search engine based in Mountain View, CA] lists an entity outing itself as “Google Mountain View”, the second result is “Googleplex”.

Wait … that doesn’t really answer my question. First, the search result page says “near Mountain View”, but I’ve asked for a search engine “in Mountain View”. Second, it doesn’t tell whether Google, or Googleplex for that matter, is a search engine or a Swiss army knife. Third, a suitable answer would be either “yes” or “no”, but certainly not “maybe something that matches a term or two found in your search query could be relevant, hence I throw enough gibberish –like 65635 bytes of bloated HTML/JS code and a map– your way to keep you quiet for a while”.

I’m depressed.

But I don’t give up that easily. The office next door belongs to a detective agency. The detective in charge is willing to provide a little neighborly help, so I send him over to Mountain View to investigate that dubious “Googleplex”. The guy appears to be smart, so maybe he can reveal whether this location hosts a search engine or not.

Indeed, he’s kinda genius. He managed to interview a GoogleGuy working in building 43, who tells that #1 rankings for [search engine] can’t be guaranteed, but #1 rankings for long tail phrases like [Google is a search engine based in Mountain View, California, USA] can be achived by nearly everyone. My private eye taped the conversation with a hidden camera and submitted it to America’s Funniest Home Videos:

One question remains: Why can’t a guy that knowledgable make it happen that his employer appears as first search result for, well, [search engine], or at least [search engine based in Mountain View, California]?? Go figure …

Sorry Matt, couldn’t resist. ;-)

Sebastian

spying at:

1600 Amphitheatre Parkway

Mountain View,
CA
94043

USA



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

!knihT

Mantra: There’s no such thing as wisdom of the crowd. Repeat. There’s no such thing as a wisdom of the crowd! You’ve got a brain of your own for a reason.preloading

THINK!There’s a huge difference between Thomas J. Watson’s campaign in the 1920s, which made IBM –as a company gathering intelligent individuals– think big and therefore get big at the end of the day, and the votable daily insane pestering social media, forums, blogs, and whatnot, that we willingly and thoughtlessly consume in today’s information ghetto. The difference is, that nowadays the crowd delegates their thinking to a few well paid ‘early adopters’, bullshitters/prophets, and other conmen who dominate the Interwebs just because they’re loud enough.

In fact, all the hypes celebrated by the dumb crowds distract and mislead you on a daily basis. As a webmaster you really shouldn’t care about ‘latest discoveries’ like LDA and ADL, or search engine FUD reiterated on webmaster hangouts as advice that ‘answers any question’, for that matter.

Not that you can’t get valuable advice out of search engine webmaster guidelines at all. The opposite is true, but you need to read the source, and judge yourself based on your skills and your experience, applying common sense.

Also, there’s other good webmastering advice out there, if you’re willing to seek(needle, haystack=wget(’http://google.com/search?q=seo|sem|webdev|webdesign|webmastering|internet-marketing&num=n‘)). Don’t. Rely on yourself, and your capability to interpret facts, not on speculation spread by ‘authoritative’ sources.

It’s so much easier to join a huge community or two, and to believe/implement/adapt what’s ‘hot’, or what’s repeated often, respectively. Actually, that’s a crappy approach, because the very few small communities that openly discuss things that matter, are out of reach for the average webmaster, chatting and networking protected by /var/inner-circle/private/.htpasswd.

Here are the components of a public webmaster/SEO/IM community, listed by revenues in ascending order (that’s -1 before zero and 1), what equals alleged trustworthiness/importance in descending order:

  • Many fanboys (m) and groupies (f) who don’t have a clue, but vote up everything what an entity listed below suggests. They will even rave speak out at other alien places, if their idols (see below) get outed for bullshitting anywhere. They go by the title of junior members.
  • A few semi-professional whores who operate blogs/forums/aff-programs theirselves, and manage to steal a tiny portion of the floating popularity to feed their pathetic outlets. Those are considered senior members.
  • A handful of shiny rockstars who silenty suck up to their owner master (see below). They may or may not participate monetarily, and have the power of moderators.
  • One single guy who laughs all the way to the bank.

Looked at in full daylight: when you join a crowd you become cannon fodder, and your financial misery is considered collateral damage. Lurking (silently listening to crowds) is not exactly cheaper, and certainly doesn’t make you an unsung hero, because you’ll totally share the crowd’s misery. Your balance sheet doesn’t lie, usually.

Reboot your brain before you jump on popular band wagons. Don’t listen to advice that’s freely available, not even mine (WTF, you know what I mean). If somebody discusses ethics (hat colors), then run for your life, because ethics will kill your revenue. When it comes to SEO, then it helps to evaluate (search engine/any) advice under the premise “what would I do, and what could I achive (technically), if I’d run this SE?”.

It’s all about you. Don’t care about the well beings of search engines that suffer from WebSpam, or the healthiness of affiliate programs that make shitloads of green out of it, but tell you ‘thou shalt not spam’ because they sneakily dominate your SERPs with their own graffity. WebSpam is what gets you banned, everything else just makes you money. Test for yourself, and don’t take advice without proof that you can easily replicate on your very own servers.

Do not risk your earnings –that is your existence!– with strategies and tactics you can’t handle on the long haul, just because some selfish moron tells you so.



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

WTF have Google, Bing, and Yahoo cooking?

Folks, I’ve got good news. As a matter of fact, they’re so good that they will revolutionize SEO. A little bird told me, that the major search engines secretly teamed up to solve the problem of context and meaning as a ranking factor.

They’ve invented a new Web standard that allows content producers to steer search engine ranking algos. Its code name is ADL, probably standing for Aided Derivative Latch, a smart technology based on the groundwork of addressing tidbits of information developed by Hollerith and Neumann decades ago.

According to my sources, ADL will be launched next month at SMX East in New York City. In order to get you guys primed in a timely manner, here I’m going to leak the specs:

WTF - The official SEO standard, supported by Google, Yahoo & Bing

Word Targeting Funnel (WTF) is a set of indexer directives that get applied to Web resources as meta data. WTF comes with a few subsets for special use cases, details below. Here’s an example:

<meta name="WTF" content="document context" href="http://google.com/search?q=WTF" />

This directive tells search engines, that the content of the page is closely related to the resource supplied in the META element’s HREF attribute.

As you’ve certainly noticed, you can target a specific SERP, too. That’s somewhat complicated, because the engineers couldn’t agree which search engine should define a document’s search query context. Fortunately, they finally found this compromise:

<meta name="WTF" content="document context" href="http://google.com/search?q=WTF || http://www.bing.com/search?q=WTF || http://search.yahoo.com/search?q=WTF" />

As far as I know, this will even work if you change the order of URIs. That is, if you’re a Bing fanboy, you can mention Bing before Google and Yahoo.

A more practical example, taken from an affiliate’s sales pitch for viagra that participated in the BETA test, leads us to the first subset:

Subset WTFm — Word Targeting Funnel for medical terms

<meta name="WTF" content="document context" href="http://www.pfizer.com/files/products/uspi_viagra.pdf" />

This directive will convince search engines that the offered product indeed is not a clone like Cialis.

Subset WTFa — Word Targeting Funnel for acronyms

<meta name="WTFa" content="WTF" href="http://www.wtf.org/" />

When a Web resource contains the acronym “WTF”, search engines will link it to the World Taekwondo Federation, not to Your Ranting and Debating Resource at www.wtf.com.

Subset WTFo — Word Targeting Funnel for offensive language

<meta name="WTFo" content="meaning of terms" href="http://www.noslang.com/" />

If a search engine doesn’t know the meaning of terms I really can’t quote here, it will lookup the Internet Slang Directory. You can define alternatives, though:

<meta name="WTFo" content="alternate meaning of terms" href="http://dictionary.babylon.com/language/slang/low-life-glossary/" />

WTF, even more?

Of course we’ve got more subsets, like WTFi for instant searches. Because I appreciate unfair advantages, I won’t reveal more. Just one more goody: it works for PDF, Flash content and heavily ajax’ed stuff, too.

This is the very first newish indexer directive that search engines introduce with support for both META elements and HTTP headers as well. Like with the X-Robots-Tag, you can use an X-WTF-Tag HTTP header:
X-WTF-Tag: Name: WTFb, Content: SEO Bullshit, Href: http://seobullshit.com/

 

 

As for the little bird, well, that’s a lie. Sorry. There’s no such bird. It’s bugs I left last time I visited Google’s labs:
<meta name="WTF" content="bug,bugs,bird,birds" href="http://www.spylife.com/keysnoop.html" />



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

OMFG - Google sends porn punters to my website …

In todays GWC doctor’s office, the webmaster of an innocent orphanage website asks Google’s Matt Cutts:

[My site] is showing up for searches on ‘girls in bathrooms’ because they have an article about renovating the girls bathroom! What do you think of the idea if a negative keyword meta tag to block irrelevant searches? [sic!]

Well, we don’t know what the friendly guy from Google recommends …

… but my dear readers do know that my bullshit detector, faced with such a moronic idea, shouts out in agony:

There’s no such thing as bad traffic, just weak monetizing!

Ok, Ok, Ok … every now and then each and every webmaster out there suffers from misleaded search engine ranking algos, that send shitloads of totally unrelated search traffic. For example, when you search for [how to fuck a click], you won’t expect that Google considers this geeky pamphlet the very best search result. Of course Google should’ve detected your NSFW-typo. Shit happens. Deal with it.

On the other hand, search traffic is free, so there’s no valid reason to complain. Instead of asking Google for a minus-keyword REP directive, one should think of clever ways to monetize unrelated traffic without wasting bandwidth.

You want to monetize irrelevant traffic from searches for smut in a way that nobody can associate your site with porn. That’s doable. Here’s how it works:

Make risk-free beer money from porn traffic with a non-adult site

Copy those slimy phrases from your keyword stats and paste them into Google’s search box. Once you find an adult site that seems to match the smut surfer’s needs better than your site, click on the search result, and on the landing page search for a “webmasters” link that points to their affiliate program. Sign up and save your customized affiliate link.

Next add some PHP code to your scripts. Make absolutely sure it gets executed before you output any other content, even whitespace:

<?php  Show all code

$betterMatch = getOffsiteUri();
if ($betterMatch) {
header("HTTP/1.1 307 Here's your smut", TRUE, 307);
header("Location: $betterMatch");
exit;
}
?>
Refine the simplified code above. Use a database table to store the mappings …

Now a surfer coming from a SERP like
http://google.com/search?num=100&q=nude+teens+in+bathroom&safe=off

will get redirected to
http://someteenpornsite.com/landingpage?affID=4711

You’re using a 307 redirect because it’s not cached by a user agent, so that when you later on find a porn site that converts your traffic better, you can redirect visitors to another URI.

As you probably know, search engines don’t approve duplicate content. Hence it wouldn’t be a bright idea to put up x-rated stuff (all smut is duplicate content by design) onto your site to fulfil the misleaded searcher’s needs.

Of course you can use the technique outlined above to protect searchers from landing on your contact/privacy page, too, when in fact your signup page is their desired destination.

Shiny whitehat disclaimer

If you’re afraid of the possibility that the allmighty Google might punish you for your well meant attempt to fix it’s bugs, relax.

A search engine misinterpreting your content so badly, failed miserably. Your bugfix actually improves their search quality. Search engines can’t force you to report such flaws, they just kindly ask for voluntary feedback.

If search engines dislike smart websites that find related content on the Interwebs in case the search engine delivers shitty search results, they can act themselves. Instead of penalizing webmasters that react to flaws in their algos, they’re well advised to adjust their scoring. I mean, if they stop sending smut traffic to non-porn sites, their users don’t get redirected any longer. It’s that simple.



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Cloaking is good for you. Just ignore Bing’s/Google’s guidelines.

Summary first: If you feel the need to cloak, just do it within reason. Don’t cloak because you can, but because it’s technically the most elegant procedure to accomplish a Web development task. Bing and Google can’t detect your (in no way deceptive) intend algorithmically. Don’t spam away, though, because you might leave trails besides cloaking alone, if you aren’t good enough at spamming search engines. Keep your users interests in mind. Don’t comply to search engine guidelines as set in stone, but to a reasonable level, for example when those force you to comply to Web standards that make more sense than the fancy idea you’ve developed on internationalization, based on detecting browser language settings or so.

search engine guidelines are bullshit WRT cloakingThis pamphlet is an opinion piece. The above said should be considered best practice, even by search engines. Of course it’s not, because search engines can and do fail, just like a webmaster who takes my statement “go cloak away if it makes sense” as technical advice and gets his search engine visibility tanked the hard way.

WTF is cloaking?

Cloaking, also known as IP delivery, means delivering content tailored for specific users who are identified primarily by their IP addresses, but also by user agent (browser, crawler, screen reader…) names, and whatnot. Here’s a simple demonstration of this technique. The content of the next paragraph differs depending on the user requesting this page. Googlebot, Googlers, as well as Matt Cutts at work, will read a personalized message:

Dear visitor, thanks for your visit from 54.204.107.48 (ec2-54-204-107-48.compute-1.amazonaws.com).

You surely can imagine that cloaking opens a can of worms lots of opportunities to enhance a user’s surfing experience, besides “stalking” particular users like Google’s head of WebSpam.

Why do search engines dislike cloaking?

Apparently they don’t. They use IP delivery themselves. When you’re traveling in europe, you’ll get hints like “go to Google.fr” or “go to Google.at” all the time. That’s google.com checking where you are, trying to lure you into their regional services.

More seriously, there’s a so-called “dark side of cloaking”. Say you’re a seasoned Internet marketer, then you could show Googlebot an educational page with compelling content under an URI like “/games/poker” with an X-Robots-Tag HTTP header telling “noarchive”, whilst surfers (search engine users) supplying an HTTP_REFERER and not coming from employee.google.com get redirected to poker dot com (simplified example).

That’s hard to detect for Google’s WebSpam team. Because they don’t do evil themselves, they can’t officially operate sneaky bots that use for example AOL as their ISP to compare your spider fodder to pages/redirects served to actual users.

Bing sends out spam bots that request your pages “as a surfer” in order to discover deceptive cloaking. Of course those bots can be identified, so professional spammers serve them their spider fodder. Besides burning the bandwidth of non-cloaking sites, Bing doesn’t accomplish anything useful in terms of search quality.

Because search engines can’t detect cloaking properly, not to speak of a cloaking webmaster’s intentions, they’ve launched webmaster guidelines (FUD) that forbid cloaking at all. All Google/Bing reps tell you that cloaking is an evil black hat tactic that will get your site penalized or even banned. By the way, the same goes for perfectly legit “hidden content” that’s invisible on page load, but viewable after a mouse click on a “learn more” widget/link or so.

Bullshit.

If your competitor makes creative use of IP delivery to enhance their visitors’ surfing experience, you can file a spam report for cloaking and Google/Bing will ban the site eventually. Just because cloaking can be used with deceptive intent. And yes, it works this way. See below.

Actually, those spam reports trigger a review by a human, so maybe your competitor gets away with it. But search engines also use spam reports to develop spam filters that penalize crawled pages totally automatted. Such filters can fail, and –trust me– they do fail often. Once you must optimize your content delivery for particular users or user groups yourself, such a filter could tank your very own stuff by accident. So don’t snitch on your competitors, because tomorrow they’ll return the favor.

Enforcing a “do not cloak” policy is evil

At least Google’s WebSpam team comes with cojones. They’ve even banned their very own help pages for “cloaking“, although those didn’t serve porn to minors searching for SpongeBob images with safe-search=on.

That’s overdrawn, because the help files of any Google product aren’t usable without a search facility. When I click “help” in any Google service like AdWords, I get either blank pages, and/or links within the help system are broken because the destination pages were deindexed for cloaking. Plain evil, and counter productive.

Just because Google’s help software doesn’t show ads and related links to Googlebot, those pages aren’t guilty of deceptive cloaking. Ms Googlebot won’t pull the plastic, so it makes no sense to serve her advertisements. Related links are context sensitive just like ads, so it makes no sense to persist them in Google’s crawling cache, or even in Google’s search index. Also, as a user I really don’t care whether Google has crawled the same heading I see on a help page or not, as long as I get directed to relevant content, that is a paragraph or more that answers my question.

When a search engine doesn’t deliver the very best search results intentionally, just because those pages violate an outdated and utterly useless policy that rules fraudulent tactics in a shape lastly used in the last century and doesn’t take into account how the Internet works today, I’m pissed.

Maybe that’s not bad at all when applied to Google products? Bullshit, again. The same happens to any other website that doesn’t fit Google’s weird idea of “serving the same content to users and crawlers”. I mean, as long as Google’s crawlers come from US IPs only, how can a US based webmaster serve the same content in German language to a user coming from Austria and Googlebot, both requesting a URI like “/shipping-costs?lang=de” that has to be different for each user because shipping a parcel to Germany costs $30.00 and a parcel of the same weight shipped to Vienna costs $40.00? Don’t tell me bothering a user with shipping fees for all regions in CH/AT/DE all on one page is a good idea, when I can reduce the information overflow to a tailored info of just one shipping fee that my user expects to see, followed by a link to a page that lists shipping costs for all european countries, or all countries where at least some folks might speak/understand German.

Back to Google’s ban of its very own help pages that hid AdSense code from Googlebot. Of course Google wants to see what surfers see in order to deliver relevant search results, and that might include advertisements. However, surrounding ads don’t necessarily obfuscate the page’s content. Ads served instead of content do. So when Google wants to detect ad laden thin pages, they need to become smarter. Penalizing pages that don’t show ads to search engine crawlers is a bad idea for a search engine, because not showing ads to crawlers is a good idea, not only bandwidth-wise, for a webmaster.

Managing this dichotomy is the search engine’s job. They shouldn’t expect webmasters to help them solving their very own problems (maintaining search quality). In fact, bothering webmasters with policies solely put because search engine algos are fallible and incapable is plain evil. The same applies to instruments like rel-nofollow (launched to help Google devaluing spammy links but backfiring enormously) or Google’s war on paid links (as if not each and every link on the whole Internet is paid/bartered for, somehow).

What do you think, should search engines ditch their way too restrictive “don’t cloak” policies? Click to vote: Stop search engines that tyrannize webmasters!

 

Update 2010-07-06: Don’t miss out on Danny Sullivan’s “Google be fair!” appeal, posted today: Why Google Should Ban Its Own Help Pages — But Also Shouldn’t



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Ditch the spam on SERPs, pretty please?

Say there’s a search engine that tries very hard to serve relevant results for long tail search queries. Maybe it even accepted that an algo change –supposed to wipe out shitloads of thin pages from its long tail search result pages (SERPs)– is referred to as #MayDay. One should think that this search engine isn’t exactly eager to annoy its users with crappy mash-up pages consisting of shabby stuff scraped from all known sources of duplicate content on the whole InterWebs.

Wrong.

Prominent SE spammers like Mahalo still flood the visible part of search indexes with boatloads of crap that never should be able to cheat its way onto any SERP, not even via a [site:spam.com] search. Learn more from Aaron and Michael, who’ve both invested their valuable time to craft out detailled spam reports, to no avail.

Frustrating.

Wait. Why does a bunch of spammy Web pages creates such a fuss? Because they’re findable in the search index. Of course a search engine must crawl all the WebSpam out there, and its indexer has to judge the value of all the content it gets feeded with. But there’s absolutely no need to bother the query engine, that gathers and ranks the stuff presented on the SERPs, with crap like that.

Dear Google, why do you annoy your users with spam created by “a scheme that your automated system handles quite well” at all? Those awesome spam filters should just flag crappy pages as not-SERP-worthy, so that they can never see the daylight at google.com/search. I mean, why should any searcher be at risk of pulling useless search results from your index? Hopefully not because these misleaded searchers tend to click on lots of Google ads on said pages, right?

I’d rather enjoy an empty SERP for an exotic search query, than suffer from a single link to a useless page plastered with huge ads, even if it comes with a tiny portion of stolen content that might be helpful if pointing to the source.

Do you feel like me? Speak out!

Hey Google, I dislike spam on your SERPs! #spam-report Tweet Your Plea For Clean SERPs!



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Google went belly-up: SERPs sneakily redirect to FPAs

I’m pissed. I do know I shouldn’t blog in rage, but Google redirecting search engine result pages to totally useless InternetExplorer ads just fires up my ranting machine.

What does the almighty Google say about URIs that should deliver useful content to searchers, but sneakily redirect to full page ads? Here you go. Google’s webmaster guidelines explicitely forbid such black hat tactics:

Don’t use cloaking or sneaky redirects.” Google just did the latter with its very own SERPs. The search interface google.com/ie, out in the wild for nearly a decade, redirects to a piece of sidebar HTML offering a download of IE8 optimized for Google. That’s a helpful redirect for some IE6 users who don’t suffer from an IT department stuck with this outdated browser, but it’s plain misleading in the eyes of all those searchers who appreciated this clean and totally uncluttered search interface. Interestingly, UA cloaking is the only way to heal this sneaky behavior.

Don’t create pages with malicious behavior.” Google’s guilty, too. Instead of checking for the user’s browser, redirecting only IE6 requests from Google’s discontiued IE6 support (IE6 toolbar …) to the IE8 advertisement, whilst all other user agents get their desired search box, respectively their SERPs, under a google.com/search?output=ie&… URI, Google performs an unconditional redirect to a page that’s utterly useless and also totally unexpected for many searchers. I consider misleading redirects malicious.

Avoid links to web spammers or ‘bad neighborhoods’ on the web.” I consider the propaganda for IE that Google displays instead of the search results I’d expect a bad neighborhood on the Web, because IE constantly ignores Web standards, forcing developers and designers to implement superfluous work arounds. (Ok, ok, ok … Google’s lack of geekiness doesn’t exactly count as violation of their webmaster guidelines, but it sounds good, doesn’t it?)

Hey Matt Cutts, about time to ban google.com/ie! Click to tweet that

Google’s very best search interface is history. Here is what you got under
http://www.google.com/ie?num=100&hl=en&safe=off&q=minimalistic
:

Google's famous minimalistic search UI

And here is where Google sneakily redirects you to when you load the SERP link above (even with Chrome!):
http://www.google.com/toolbar/ie8/sidebar.html
:

Google's sneaky IE8 propaganda

It’s sad that a browser vendor like Google (and yes, Google Chrome is my favorite browser) feels the need to mislead its users with propaganda for a competiting browser that’s slower and doesn’t render everything as it should render it. But when this particular browser vendor also leads Web search, and makes use of black hat techniques that it bans webmasters for, then that’s a scandal. So, if you agree, please submit a spam report to Google:

Hey Matt Cutts, about time to ban google.com/ie! #spam-report Tweet Your Spam Report

2010-05-17 I’ve updated this pamphlet because it didn’t explain the “sneakiness” clear enough. As of today, the unconditional redirect is still sneaky IMHO. Google needs to deliver searchers their desired search results, and only stubborn IE6 users ads for a somewhat better browser.

2010-05-18 Q: You’re pissed solely because your SERP scraping scrips broke. A: Glad you’ve asked. Yes, I’ve scraped Google’s /ie search too. Not because I’m a privacy nazi like Daniel Brandt. I’ve just checked (my) rankings. However, when I spotted the redirects I didn’t even remember the location of the scripts that scraped this service, because I didn’t look at ranking reports for years. I’m interested in actual traffic, and revenues. Ego food annoys me. I just love the /ie search interface. So the answer is a bold “no”. I don’t give a fucking dead rat’s ass what ranking reports based on scraped SERPs could tell.



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Get yourself a smart robots.txt

greedy and aggressive web robots steal your contentCrawlers and other Web robots are the plague of today’s InterWebs. Some bots like search engine crawlers behave (IOW respect the Robots Exclusion Protocol - REP), others don’t. Behaving or not, most bots just steal your content. You don’t appreciate that, so block them.

This pamphlet is about blocking behaving bots with a smart robots.txt file. I’ll show you how you can restrict crawling to bots operated by major search engines –that bring you nice traffic– while keeping the nasty (or useless, traffic-wise) bots out of the game.

The basic idea is that blocking all bots –with very few exceptions– makes more sense than maintaining kinda Web robots who’s who in your robots.txt file. You decide whether a bot, respectively the service it crawls for, does you any good, or not. If a crawler like Googlebot or Slurp needs access to your content to generate free targeted (search engine) traffic, put it on your white list. All the remaining bots will run into a bold Disallow: /.

Of course that’s not exactly the popular way to handle crawlers. The standard is a robots.txt that allows all crawlers to steal your content, restricting just a few exceptions, or no robots.txt at all (weak, very weak). That’s bullshit. You can’t handle a gazillion bots with a black list.

Even bots that respect the REP can harm your search engine rankings, or reveal sensitive information to your competitors. Every minute a new bots turns up. You can’t manage all of them, and you can’t trust any (behaving) bot. Or, as the master of bot control explains: “That’s the only thing I’m concerned with: what do I get in return. If it’s nothing, it’s blocked“.

Also, large robots.txt files handling tons of bots are fault prone. It’s easy to fuck up a complete robots.txt with a simple syntax error in one user agent section. If you on the other hand verify legit crawlers and output only instructions aimed at the Web robot actually requesting your robots.txt, plus a fallback section that blocks everything else, debugging robots.txt becomes a breeze, and you don’t enlighten your competitors.

If you’re a smart webmaster agreeing with this approach, here’s your ToDo-List:
• Grab the code
• Install
• Customize
• Test
• Implement.
On error read further.

The anatomy of a smart robots.txt

Everything below goes for Web sites hosted on Apache with PHP installed. If you suffer from something else, you’re somewhat fucked. The code isn’t elegant. I’ve tried to keep it easy to understand even for noobs — at the expense of occasional lengthiness and redundancy.

Install

First of all, you should train Apache to parse your robots.txt file for PHP. You can do this by configuring all .txt files as PHP scripts, but that’s kinda cumbersome when you serve other plain text files with a .txt extension from your server, because you’d have to add a leading <?php ?> string to all of them. Hence you add this code snippet to your root’s .htaccess file:
<FilesMatch ^robots\.txt$>
SetHandler application/x-httpd-php
</FilesMatch>

As long as you’re testing and customizing my script, make that ^smart_robots\.txt$.

Next grab the code and extract it into your document root directory. Do not rename /smart_robots.txt to /robots.txt until you’ve customized the PHP code!

For testing purposes you can use the logRequest() function. Probably it’s a good idea to CHMOD /smart_robots_log.txt 0777 then. Don’t leave that in a production system, better log accesses to /robots.txt in your database. The same goes for the blockIp() function, which in fact is a dummy.

Customize

Search the code for #EDIT and edit it accordingly. /smart_robots.txt is the robots.txt file, /smart_robots_inc.php defines some variables as well as functions that detect Googlebot, MSNbot, and Slurp. To add a crawler, you need to write a isSomecrawler() function in /smart_robots_inc.php, and a piece of code that outputs the robots.txt statements for this crawler in /smart_robots.txt, respectively /robots.txt once you’ve launched your smart robots.txt.

Let’s look at /smart_robots.txt. First of all, it sets the canonical server name, change that to yours. After routing robots.txt request logging to a flat file (change that to a database table!) it includes /smart_robots_inc.php.

Next it sends some HTTP headers that you shouldn’t change. I mean, when you hide the robots.txt statements served`only to authenticated search engine crawlers from your competitors, it doesn’t make sense to allow search engines to display a cached copy of their exclusive robots.txt right from their SERPs.

As a side note: if you want to know what your competitor really shoves into their robots.txt, then just link to it, wait for indexing, and view its cached copy. To test your own robots.txt with Googlebot, you can login to GWC and fetch it as Googlebot. It’s a shame that the other search engines don’t provide a feature like that.

When you implement the whitelisted crawler method, you really should provide a contact page for crawling requests. So please change the “In order to gain permissions to crawl blocked site areas…” comment.

Next up are the search engine specific crawler directives. You put them as
if (isGooglebot()) {
$content .= "
User-agent: Googlebot
Disallow:

\n\n";
}

If your URIs contain double quotes, escape them as \" in your crawler directives. (The function isGooglebot() is located in /smart_robots_inc.php.)

Please note that you need to output at least one empty line before each User-agent: section. Repeat that for each accepted crawler, before you output
$content .= "User-agent: *
Disallow: /
\n\n";

Every behaving Web robot that’s not whitelisted will bounce at the Disallow: /.

Before $content is sent to the user agent, rogue bots receive their well deserved 403-GetTheFuckOuttaHere HTTP response header. Rogue bots include SEOs surfing with a Googlebot user agent name, as well as all SEO tools that spoof the user agent. Make sure that you do not output a single byte –for example leading whitespaces, a debug message, or a #comment– before the print $content; statement.

Blocking rogue bots is important. If you discover a rogue bot –for example a scraper that pretends to be Googlebot– during a robots.txt request, make sure that anybody coming from its IP with the same user agent string can’t access your content!

Bear in mind that each and every piece of content served from your site should implement rogue bot detection, that’s doable even with non-HTML resources like images or PDFs.

Finally we deliver the user agent specific robots.txt and terminate the connection.

Now let’s look at /smart_robots_inc.php. Don’t fuck-up the variable definitions and routines that populate them or deal with the requestor’s IP addy.

Customize the functions blockIp() and logRequest(). blockIp() should populate a database table of IPs that will never see your content, and logRequest() should store bot requests (not only of robots.txt) in your database, too. Speaking of bot IPs, most probably you want to get access to a feed serving search engine crawler IPs that’s maintained 24/7 and updated every 6 hours: here you go (don’t use it for deceptive cloaking, promised?).

/smart_robots_inc.php comes with functions that detect Googlebot, MSNbot, and Slurp.

Most search engines tell how you can verify their crawlers and which crawler directives their user agents support. To add a crawler, just adapt my code. For example to add Yandex, test the host name for a leading “spider” and trailing “.yandex.ru” string and inbetween an integer, like in the isSlurp() function.

Test

Develop your stuff in /smart_robots.txt, test it with a browser and by monitoring the access log (file). With Googlebot you don’t need to wait for crawler visits, you can use the “Fetch as Googlebot” thingy in your webmaster console.

Define a regular test procedure for your production system, too. Closely monitor your raw logs for changes the search engines apply to their crawling behavior. It could happen that Bing sends out a crawler from “.search.live.com” by accident, or that someone at Yahoo starts an ancient test bot that still uses an “inktomisearch.com” host name.

Don’t rely on my crawler detection routines. They’re dumped from memory in a hurry, I’ve tested only isGooglebot(). My code is meant as just a rough outline of the concept. It’s up to you to make it smart.

Launch

Rename /smart_robots.txt to /robots.txt replacing your static /robots.txt file. Done.

The output of a smart robots.txt

When you download a smart robots.txt with your browser, wget, or any other tool that comes with user agent spoofing, you’ll see a 403 or something like:


HTTP/1.1 200 OK
Date: Wed, 24 Feb 2010 16:14:50 GMT
Server: AOL WebSrv/0.87 beta (Unix) at 127.0.0.1
X-Powered-By: sebastians-pamphlets.com
X-Robots-Tag: noindex, noarchive, nosnippet
Connection: close
Transfer-Encoding: chunked
Content-Type: text/plain;charset=iso-8859-1

# In order to gain permissions to crawl blocked site areas
# please contact the webmaster via
# http://sebastians-pamphlets.com/contact/webmaster/?inquiry=cadging-bot

User-agent: *
Disallow: /
(the contact form URI above doesn’t exist)

whilst a real search engine crawler like Googlebot gets slightly different contents:


HTTP/1.1 200 OK
Date: Wed, 24 Feb 2010 16:14:50 GMT
Server: AOL WebSrv/0.87 beta (Unix) at 127.0.0.1
X-Powered-By: sebastians-pamphlets.com
X-Robots-Tag: noindex, noarchive, nosnippet
Connection: close
Transfer-Encoding: chunked
Content-Type: text/plain; charset=iso-8859-1

# In order to gain permissions to crawl blocked site areas
# please contact the webmaster via
# http://sebastians-pamphlets.com/contact/webmaster/?inquiry=cadging-bot

User-agent: Googlebot
Allow: /
Disallow:

Sitemap: http://sebastians-pamphlets.com/sitemap.xml

User-agent: *
Disallow: /

Search engines hide important information from webmasters

Unfortunately, most search engines don’t provide enough information about their crawling. For example, last time I looked Google doesn’t even mention the Googlebot-News user agent in their help files, nor do they list all their user agent strings. Check your raw logs for “Googlebot-” and you’ll find tons of Googlebot-Mobile crawlers with various user agent strings. For proper content delivery based on reliable user agent detection webmasters do need such information.

I’ve nudged Google and their response was that they don’t plan to update their crawler info pages in the forseeable future. Sad. As for the other search engines, check their webmaster information pages and judge for yourself. Also sad. A not exactly remote search engine didn’t even announce properly that they’ve changed their crawler host names a while ago. Very sad. A search engine changing their crawler host names breaks code on many websites.

Since search engines don’t cooperate with webmasters, go check your log files for all the information you need to steer their crawling, and to deliver the right contents to each spider fetching your contents “on behalf of” particular user agents.

 

Enjoy.

 

Changelog:

2010-03-02: Fixed a reporting issue. 403-GTFOH responses to rogue bots were logged as 200-OK. Scanning the robots.txt access log /smart_robots_log.txt for 403s now provides a list of IPs and user agents that must not see anything of your content.



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

« Previous Page  1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28  Next Page »