Spam

Archived posts from the 'Spam' Category

How to spam the hell out of Google’s new source attribution meta elements

Posted on 16 November, 2010

The moment you’ve read Google’s announcement and Matt’s question “What about spam?” you concluded “spamming it is a breeze”, right? You’re not alone.

Before we discuss how to abuse it, it might be a good idea to define it within its context, ok?

Playground

First of all, Google announced these meta tags on the official Google News blog for a reason. So when you plan to abuse it with your countless MFA proxies of Yahoo Answers, you most probably jumped on the wrong band wagon. Google supports the meta elements below in Google News only.

syndication-source

The first new indexer hint is syndication-source. It’s meant to tell Google the permalink of a particular news story, hence the author and all the folks spreading the word are asked to use it to point to the one -and only one- URI considered the source:

<meta name="syndication-source" content="http://outerspace.com/news/ubercool-geeks-launched-google-hotpot.html" />

The meta element above is for instances of the story served from
http://outerspace.com/breaking/page1.html
http://outerspace.com/yyyy-mm-dd/page2.html
http://outerspace.com/news/aliens-appreciate-google-hotpot.html
http://outerspace.com/news/ubercool-geeks-launched-google-hotpot.html
http://newspaper.com/main/breaking.html
http://tabloid.tv/rehashed/from/rss/hot:alien-pot-in-your-bong.html
…

Don’t confuse it with the cross-domain rel-canonical link element. It’s not about canning duplicate content, it marks a particular story, regardless whether it’s somewhat rewritten or just reprinted with a different headline. It tells Google News to use the original URI when the story can be crawled from different URIs on the author’s server, and when syndicated stories on other servers are so similar to the initial piece that Google News prefers to use the original (the latter is my educated guess).

original-source

The second new indexer hint is original-source. It’s meant to tell Google the origin of the news itself, so the author/enterprise digging it out of the mud, as well as all the folks using it later on, are asked to declare who broke the story:

<meta name="original-source" content="http://outerspace.com/news/ubercool-geeks-launched-google-hotpot.html" />

Say we’ve got two or more related news, like “Google fell from Mars” by cnn.com and “Google landed in Mountain View” by sfgate.com, it makes sense for latimes.com to publish a piece like “Google fell from Mars and landed in Mountain View”. Because latimes.com is a serious newspaper, they credit their sources not only with a mention or even embedded links, they do it machine-readable, too:

<meta name="original-source" content="http://cnn.com/google-fell-from-mars.html" />
<meta name="original-source" content="http://sfgate.com/google-landed-in-mountain-view.html" />

It’s a matter of course that both cnn.com and sfgate.com provide such an original-source meta element on their pages, in addition to the syndication-source meta element, both pointing to their very own coverage.

If a journalist grabbed his breaking news from a secondary source telling “CNN reported five minutes ago that Google’s mothership started from Venus, and the LA Times spotted it crashing on Jupiter”, he can’t be bothered with looking at the markup and locating those meta elements in the head section, he has a deadline for his piece “Why Web search left Planet Earth”. It’s just fine with Google News when he puts

<meta name="original-source" content="http://cnn.com/" />
<meta name="original-source" content="http://sfgate.com/" />

Fine-prints

As always, the most interesting stuff is hidden on a help page:

At this time, Google News will not make any changes to article ranking based on this tags.

If we detect that a site is using these metatags inaccurately (e.g., only to promote their own content), we’ll reduce the importance we assign to their metatags. And, as always, we reserve the right to remove a site from Google News if, for example, we determine it to be spammy.

As with any other publisher-supplied metadata, we will be taking steps to ensure the integrity and reliability of this information.

It’s a field test

We think it is a promising method for detecting originality among a diverse set of news articles, but we won’t know for sure until we’ve seen a lot of data. By releasing this tag, we’re asking publishers to participate in an experiment that we hope will improve Google News and, ultimately, online journalism. […] Eventually, if we believe they prove useful, these tags will be incorporated among the many other signals that go into ranking and grouping articles in Google News. For now, syndication-source will only be used to distinguish among groups of duplicate identical articles, while original-source is only being studied and will not factor into ranking. [emphasis mine]

Spam potential

Well, we do know that Google Web search has a spam problem, IOW even a few so-1999-webspam-tactics still work to some extent. So we tend to classify a vague threat like “If we find sites abusing these tags, we may […] remove [those] from Google News entirely” as FUD, and spam away. Common sense and experience tells us that a smart marketer will make money from everything spammable.

But: we’re not talking about Web search. Google News is a clearly laid out environment. There are only so many sites covered by Google News. Even if Google wouldn’t be able to develop algos analyzing all source attribution attributes out there, they do have the resources to identify abuse using manpower alone. Most probably they will do both.

They clearly told us that they will compare those meta data to other signals. And that’s not only very weak indicators like “timestamp first crawled” or “first heard of via pubsubhubbub”. It’s not that hard to isolate particular news, gather each occurrence as well as source mentions within, and arrange those on a time line with clickable links for QC folks who most certainly will identify the actual source. Even a few spot tests daily will soon reveal the sites whose source attribution meta tags are questionable, or even spammy.

If you’re still not convinced, fair enough. Go spam away. Once you’ve lost your entry on the whitelist, your free traffic from Google News, as well as from news-one-box results on conventional SERPs, is toast.

Last but not least, a fair warning

Now, if you still want to use source attribution meta elements on your non-newsworthy MFA sites to claim owership of your scraped content, feel free to do so. Most probably Matt’s team will appreciate just another “I’m spamming Google” signal.

Not that reprinting scraped content is considered shady any more: even a former president does it shamelessly. It’s just the almighty Google in all of its evilness that penalizes you for considering all on-line content public domain.

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

8 comments Sebastian | Search Quality, Testing, Webspam, Spam, Plagiarism, Google

The anatomy of a deceptive Tweet spamming Google Real-Time Search

Posted on 10 December, 2009

Minutes after the launch of Google’s famous Real Time Search, the Internet marketing community began to spam the scrolling SERPs. Google gave birth to a new spam industry.

I’m sure Google’s WebSpam team will pull the plug sooner or later, but as of today Google’s real time search results are extremely vulnerable to questionable content.

The somewhat shady approach to make creative use of real time search I’m outlining below will not work forever. It can be used for really evil purposes, and Google is aware of the problem. Frankly, if I’d be the Googler in charge, I’d dump the whole real-time thingy until the spam defense lines are rock solid.

Here’s the recipe from Dr Evil’s WebSpam-Cook-Book:

Ingredients

1 popular topic that pulls lots of searches, but not so many that the results scroll down too fast.
1 landing page that makes the punter pull out the plastic in no time.
1 trusted authority page totally lacking commercial intentions. View its source code, it must have a valid TITLE element with an appealing call for action related to your topic in its HEAD section.
1 short domain, 1 cheap Web hosting plan (Apache, PHP), 1 plain text editor, 1 FTP client, 1 Twitter account, and a prize basic coding skills.

Preparation

Create a new text file and name it hot-topic.php or so. Then code:<?php $landingPageUri = "http://affiliate-program.com/?your-aff-id"; $trustedPageUri = "http://google.com/something.py"; if (stristr($_SERVER["HTTP_USER_AGENT"], "Googlebot")) { header("HTTP/1.1 307 Here you go today", TRUE, 307); header("Location: $trustedPageUri"); } else { header("HTTP/1.1 301 Happy shopping", TRUE, 301); header("Location: $landingPageUri"); } exit; ?>

Provided you’re a savvy spammer, your crawler detection routine will be a little more complex.

Save the file and upload it, then test the URI http://youspamaw.ay/hot-topic.php in your browser.

Serving

Login to Twitter and submit lots of nicely crafted, not too much keyword stuffed messages carrying your spammy URI. Do not use obscene language, e.g. don’t swear, and sail around phrases like ‘buy cheap viagra’ with synonyms like ‘brighten up your girl friend’s romantic moments’.
On their SERPs, Google will display the text from the trusted page’s TITLE element, linked to your URI that leads punters to a sales pitch of your choice.
Just for entertainment, closely monitor Google’s real time SERPs, and your real-time sales stats as well.
Be happy and get rich by end of the week.

Google removes links to untrusted destinations, that’s why you need to abuse authority pages. As long as you don’t launch f-bombs, Google’s profanity filters make flooding their real time SERPs with all sorts of crap a breeze.

Hey Google, for the sake of our children, take that as a spam report!

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

20 comments Sebastian | Webspam, Search Quality, Redirects, Internet Marketing, Spam, Twitter, SEO, Cloaking, Crap, Google

Hard facts about URI spam

Posted on 1 December, 2009

I stole this pamphlet’s title (and more) from Google’s post Hard facts about comment spam for a reason. In fact, Google spams the Web with useless clutter, too. You doubt it? Read on. That’s the URI from the link above:

http://googlewebmastercentral.blogspot.com/2009/11/hard-facts-about-comment-spam.html?utm_source=feedburner&utm_medium=feed &utm_campaign=Feed%3A+blogspot%2FamDG+%28Official+Google+Webmaster+Central+Blog%29

GA Kraken I’ve bolded the canonical URI, everything after the questionmark is clutter added by Google.

When your Google account lists both Feedburner and GoogleAnalytics as active services, Google will automatically screw your URIs when somebody clicks a link to your site in a feed reader (you can opt out, see below).

Why is it bad?

FACT: Google’s method to track traffic from feeds to URIs creates new URIs. And lots of them. Depending on the number of possible values for each query string variable (utm_source utm_medium utm_campaign utm_content utm_term) the amount of cluttered URIs pointing to the same piece of content can sum up to dozens or more.

FACT: Bloggers (publishers, authors, anybody) naturally copy those cluttered URIs to paste them into their posts. The same goes for user link drops at Twitter and elsewhere. These links get crawled and indexed. Currently Google’s search index is flooded with 28,900,000 cluttered URIs mostly originating from copy+paste links. Bing and Yahoo didn’t index GA tracking parameters yet.

That’s 29 million URIs with tracking variables that point to duplicate content as of today. With every link copied from a feed reader, this number will increase. Matt Cutts said “I don’t think utm will cause dupe issues” and points to John Müller’s helpful advice (methods a site owner can apply to tidy up Google’s mess).

Maybe Google can handle this growing duplicate content chaos in their very own search index. Lets forget that Google is the search engine that advocated URI canonicalization for ages, invented sitemaps, rel=canonical, and countless high sophisticated algos to merge indexed clutter under the canonical URI. It’s all water under the bridge now that Google is in the create-multiple-URIs-pointing-to-the-same-piece-of-content business itself.

So far that’s just disappointing. To understand why it’s downright evil, lets look at the implications from a technical point of view.

Spamming URIs with utm tracking variables breaks lots of things

Look at this URI: http://www.example.com/search.aspx?Query=musical+mobile?utm_source=Referral&utm_medium=Internet&utm_campaign=celebritybabies

Google added a query string to a query string. Two URI segment delimiters (“?”) can cause all sorts of troubles at the landing page.

Some scripts will process only variables from Google’s query string, because they extract GET input from the URI’s last questionmark to the fragment delimiter “#” or end of URI; some scripts expecting input variables in a particular sequence will be confused at least; some scripts might even use the same variable names … the number of possible errors caused by amateurish extended query strings is infinite. Even if there’s only one “?” delimiter in the URI.

In some cases the page the user gets faced with will lack the expected content, or will display a prominent error message like 404, or will consist of white space only because the underlying script failed so badly that the Web server couldn’t even show a 5xx error.

Regardless whether a landing page can handle query string parameters added to the original URI or not (most can), changing someone’s URI for tracking purposes is plain evil, IMHO, when implemented as opt-out instead of opt-in.

Appended UTM query strings can make trackbacks vanish, too. When a blog checks whether the trackback URI is carrying a link to the blog or not, for example with this plug-in, the comparision can fail and the trackback gets deleted on arrival, without notice. If I’d dig a little deeper, most probably I could compile a huge list of other functionalities on the Internet that are broken by Google’s UTM clutter.

Finally, GoogleAnalytics is not the one and only stats tool out there, and it doesn’t fulfil all needs. Many webmasters rely on simple server reports, for example referrer stats or tools like awstats, for various technical purposes. Broken. Specialized content management tools feeded by real-time traffic data. Broken. Countless tools for linkpop analysis group inbound links by landing page URI. Broken. URI canonicalization routines. Broken, respecively now acting counterproductive with regard to GA reporting. Google’s UTM clutter has impact on lots of tools that make sense in addition to Google Analytics. All broken.

What a glorious mess. Frankly, I’m somewhat puzzled. Google has hired tens of thousands of this planet’s brightest minds -I really mean that, literally!-, and they came out with half-assed crap like that? Un-fucking-believable.

What can I do to avoid URI spam on my site?

Boycott Google’s poor man’s approach to link feed traffic data to Web analytics. Go to Feedburner. For each of your feeds click on “Configure stats” and uncheck “Track clicks as a traffic source in Google Analytics”. Done. Wait for a suitable solution.

If you really can’t live with traffic sources gathered from a somewhat unreliable HTTP_REFERER, and you’ve deep pockets, then hire a WebDev crew to revamp all your affected code. Coward!

As a matter of fact, Google is responsible for this royal pain in the ass. Don’t fix Google’s errors on your site. Let Google do the fault recovery. They own the root of all UTM evil, so they have to fix it. There’s absolutely no reason why a gazillion of webmasters and developers should do Google’s job, again and again.

What can Google do?

Well, that’s quite simple. Instead of adding utterly useless crap to URIs found in feeds, Google can make use of a clever redirect script. When Feedburner serves feed items to anybody, the values of all GA tracking variables are available.

Instead of adding clutter to these URIs, Feedburner could replace them with a script URI that stores the timestamp, the user’s IP addy, and whatnot, then performs a 301 redirect to the canonical URI. The GA script invoked on the landing page can access and process these data quite accurately.

Perhaps this procedure would be even more accurate, because link drops can no longer mimick feed traffic.

Speak out!

So, if you don’t approve that Feedburner, GoogleReader, AdSense4Feeds, and GoogleAnalytics gang rape your well designed URIs, then link out to everything Google with a descriptive query string, like:

?utm_source=sebastian&utm_medium=pamphlet&utm_campaign=thou+shalt+not+fuck+with+my+uris

I mean, nicely designed canonical URIs should be the search engineer’s porn, so perhaps somebody at Google will listen. Will ya?

Update:

I’ve just added a “UTM Killer” tool, where you can enter a screwed URI and get a clean URI — all ‘utm_’ crap and multiple ‘?’ delimiters removed — in return. That’ll help when you copy URIs from your feedreader to use them in your blog posts.

By the way, please vote up this pamphlet so that I get the 2010 SEMMY Award. Thanks in advance!

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

19 comments Sebastian | Search Quality, Duplicate Content, Analytics, Internet Marketing, Webspam, Spam, SEO, Crap, Copy+Paste-Penalties, AdSense, Google

Handle your (UGC) feeds with care!

Posted on 20 November, 2009

When you run a website that deals with user generated content (UGC), this pamphlet is for you. #bad_news Otherwise you might enjoy it. #malicious_joy

Jump station

Not that the recent -and ongoing- paradigm shift shift in crawling, indexing, and ranking is bad news in general. On the contrary, for the tech savvy webmaster it comes with awesome opportunities with regard to traffic generation and search engine optimization. In this pamphlet I’ll blather about pitfalls, leaving chances unmentioned.

Background

For a moment forget everything you’ve heard about traditional crawling, indexing and ranking. Don’t buy that search engines solely rely on ancient technologies like fetching and parsing Web pages to scrape links and on-the-page signals out of your HTML. Just because someone’s link to your stuff is condomized, that doesn’t mean that Google & Co don’t grab its destination instantly.

More and more, search engines crawl, index, and rank stuff hours, days, or even weeks before they actually bother to fetch the first HTML page that carries a (nofollow’ed) link pointing to it. For example, Googlebot might follow a link you’ve tweeted right when the tweet appears on your timeline, without crawling http://twitter.com/your-user-name or the timeline page of any of your followers. Magic? Nope. The same goes for favs/retweets, stumbles, delicious bookmarks etc., by the way.

Guess why Google encourages you to make your ATOM and RSS feeds crawlable. Guess why FeedBurner publishes each and every update of your blog via PubSubHubbub so that Googlebot gets an alert of new and updated posts (and comments) as you release them. Guess why Googlebot is subscribed to FriendFeed, crawling everything (blog feed items, Tweets, social media submissions, comments …) that hits a FriendFeed user account in real-time. Guess why GoogleReader passes all your Likes and Shares to Googlebot. Guess why Bing and Google are somewhat connected to Twitter’s database, getting all updates, retweets, fav-clicks etc. within a few milliseconds.

Because all these data streams transport structured data that are easy to process. Because these data get pushed to the search engine. That’s way cheaper, and faster, than polling a gazillion of sources for updates 24/7/365.

Making use of structured data (XML, RSS, ATOM, PUSHed updates …) enables search engines to index fresh content, as well as reputation and other off-page signals, on the fly. Many ranking signals can be gathered from these data streams and their context, others are already on file. Even if a feed item consists of just a few words and a link (e.g. a tweet or stumble-thumbs-up), processing the relevant on-the-page stuff from the link’s final destination by parsing cluttered HTML to extract content and recommendations (links) doesn’t really slow down the process.

Later on, when the formerly fresh stuff starts to decompose on the SERPs, signals extracted from HTML sources kick in, for example link condoms, link placement, context and so on. Starting with the discovery of a piece of content, search engines permanently refine their scoring, until the content finally drops out of scope (spam filtering, unpaid hosting bills and other events that make Web content disappear).

Traditional discovery crawling, indexing, and ranking doesn’t exactly work for real-time purposes, nor in near real-time. Not even when a search engine assigns a bazillion of computers to this task. Also, submission based crawling is not exactly a Swiss Army knife when it comes to timely content. Although XML-sitemaps were a terrific accelerator, they must be pulled for processing, hence a delay occurs by design.

Nothing is like it used to be. Change happens.

Why does this paradigm shift puts your site at risk?

Spam puts your feeds at risk As a matter of fact, when you publish user generated content, you will get spammed. Of course, that’s bad news of yesterday. Probably you’re confident that your anti-spam defense lines will protect you. You apply link condoms to UGC link drops and all that. You remove UGC once you spot it’s spam that slipped through your filters.

Bad news is, your medieval palisade won’t protect you from a 21th century tank attack with air support. Why not? Because you’ve secured your HTML presentation layer, but not your feeds. There’s no such thing as a rel-nofollow microformat for URIs in feeds, and even condomized links transported as CDATA (in content elements) are surrounded by spammy textual content.

Feed items come with a high risk. Once they’re released, they’re immortal and multiply themselves like rabbits. That’s bad enough in case a pissed employee ‘accidently’ publishes financial statements on your company blog. It becomes worse when seasoned spammers figure out that their submissions can make it into your feeds, and be it only for a few milliseconds.

If your content management system (CMS) creates a feed item on submission, search engines -in good company with legions of Web services- will distribute it all over the InterWeb, before you can hit the delete button. It will be cached, duplicated, published and reprinted … it’s out of your control. You can’t wipe out all of its instances. Never.

Congrats. You found a surefire way to piss off both your audience (your human feed subscribers getting their feed reader flooded with PPC spam), and search engines as well (you send them weird spam signals that rise all sorts of red flags). Also, it’s not desirable to make social media services -that you rely on for marketing purposes- too suspicious (trigger happy anti-spam algos might lock away your site’s base URI in an escape-proof dungeon).

So what can you do to prevent your feeds from unwanted content?

Protect your feeds Before I discuss advanced feed protection, let me point you to a few popular vulnerabilities you might haven’t considered yet:

No nay never use integers as IDs before you’re dead sure that a piece of submitted content is floral white as snow. Integer sequences produce guessable URIs. Instead, generate a UUID (aka GUID) as identifier. Yeah, I know that UUIDs make ugly URIs, but those aren’t predictable and therefore not that vulnerable. Once a content submission is finally approved, you can donate it a nice -maybe even meaningful- URI.
No nay never use titles, subjects or so in URIs, not even converted text from submissions (e.g. ‘My PPC spam’ ==> ‘my_ppc_spam’). Why not? See above. And you don’t really want to create URIs that contain spammy keywords, or keywords that are totally unrelated to your site. Remember that search engines do index even URIs they can’t fetch, or which they can’t refetch, at least for a while.
Before the final approval, serve submitted content with a “noindex,nofollow,noarchive,nosnippet” X-Robots-Tag in the HTTP header, and put a corresponding meta element in the HEAD section. Don’t rely on link condoms. Sometimes search engines ignore rel-nofollow as an indexer directive on link level, and/or decide that they should crawl the link’s destination anyway.
Consider serving social media bots requesting a not yet approved piece of user generated content a 503 HTTP response code. You can compile a list of their IPs and user agent names from your raw logs. These bots don’t obey REP directives, that means they fetch and process your stuff regardless whether you yell “noindex” at them or not.
For all burned (disapproved) URIs that were in use ensure that your server returns a 410-Gone HTTP status code, respectively perform a 301 redirect to a policy page or so to rescue link love that would get wasted otherwise.
Your Web forms for content submissions should be totally AJAX’ed. Use CAPTCHAs and all that. Split the submission process into multiple parts, each of them talking to the server. Reject excessively rapid walk throughs, for example by asking for something unusual when a step gets completed in a too short period of time. With AJAX calls that’s painless for the legit user. Do not accept content submissions via standard GET or POST requests.
Serve link builders coming from SERPs for [URL|story|link submit|submission your site’s topic] etc. your policy page, not the actual Web form.
There’s more. With the above said, I’ve just begun to scrape the surface of a savvy spammer’s technical portfolio. There’s next to nothing a clever programmed bot can’t mimick. Be creative and think outside the box. Otherwise the spammers will be ahead of you in no time, especially when you make use of a standard CMS.

Having said that, lets proceed to feed protection tactics. Actually, there’s just one principle set in stone:

Make absolutely sure that submitted content can’t make it into your feeds (and XML sitemaps) before it’s finally approved!

The interesting question is: what the heck is a “final approval”? Well, that depends on your needs. Ideally, that’s you releasing each and every piece of submitted content. Since this approach doesn’t scale, think of ways to semi-automate the process. Don’t fully automate it, there’s no such thing as an infallible algo. Also, consider the wisdom of the crowd spammable (voting bots). Some spam will slip through, guaranteed.

Each and every content submission must survive a probation period, whereas it will not be included in your site’s feeds. Regardless who contributed it. Stick with the four-eye principle. Here are a few generic procedures you could adapt, respectively ideas which could inspire you:

Queue submissions. Possible queues are Blocked, Quarantaine, Suspect, Probation, and finally Released. Define simple rules and procedures that anyone involved can follow. SOPs lack work arounds and loopholes by design.
Stuck content submissions from new users who didn’t participate in other ways in quarantaine. Moderate this queue and only manually release into the probation queue what passes the moderator’s heuristics. Signup-submit-and-forget is a typical spammer behavior.
Maintain black lists of domain names, IPs, countries, user agent names, unwanted buzzwords and so on. Use filters to arrest submissions that contain keywords you wouldn’t expect to match your site’s theme in the Blocked or Quarantaine queue.
On submission fetch the link’s content and analyze it, don’t stick with heuristic checks of URIs, titles and descriptions. Don’t use methods like PHP’s file_get_contents that don’t return HTTP response codes. You need to know whether a requested URI is the first one of a redirect chain, for example. Double check with a second request from another IP, preferably owned by a widely used ISP, with a standard browser’s user agent string, that provides an HTTP_REFERER, for example a Google SERP with a q parameter populated with a search term compiled from the submission’s suggested anchor text. If the returned content differs too much, set a red flag.
Maintain white lists, too. That’s a great way to reduce the amount of inavoidable false positives.
If you have editorial staff or moderators, they should get a Release to Feed button. You can combine mod releases with a minimum number of user votes or so. For example you could define a rule like “release to feed if mod-release = true and num-trusted-votes > 10″.
Categorize your user’s reputation and trustworthiness. A particular number of votes from trusted users could approve a submission for feed inclusion.
Don’t automatically release submissions that have raised any flag. If that slows down the process, refine your flagging but don’t lower the limits.
With all automatted releases, for example based on votings, oops, especially based on votings, implement at least one additional sanity check. For example discard votes from new users as well as from users with a low participation history, check the sequence of votes for patterns like similar periods of time between votings, and so on.

Disclaimer: That’s just some food for thoughts. I want to make absolutely clear that I can’t provide bullet-proof anti-spam procedures. Feel free to discuss your thoughts, concerns, questions … in the comments.

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

3 comments Sebastian | Web development, Webspam, Spam

As if sloppy social media users ain’t bad enough … search engines support traffic theft

Posted on 26 October, 2009

Prepare for a dose of techy tin foil hattery. [Skip rant] Again, I’m going to rant about a nightmare that Twitter & Co created with their crappy, thoughtless and shortsighted software designs: URI shorteners (yup, it’s URI, not URL).

Recap: Each and every 3rd party URI shortener is evil by design. Those questionable services do/will steal your traffic and your Google juice, mislead and piss off your potential ~~visitors~~ customers, and hurt you in countless other ways. If you consider yourself south of sanity, do not make use of shortened URIs you don’t own.

Actually, this pamphlet is not about sloppy social media users who shoot themselves in both feet, and it’s not about unscrupulous micro blogging platforms that force their users to hand over their assets to felonious traffic thieves. It’s about search engines that, in my humble opinion, handle the sURL dilemma totally wrong.

Some of my claims are based on experiments that I’m not willing to reveal (yet). For example I won’t explain sneaky URI hijacking or how I stole a portion of tinyurl.com’s search engine traffic with a shortened URI, passing searchers to a charity site, although it seems the search engine I’ve gamed has closed this particular loophole now. There’re still way too much playgrounds for deceptive tactics involving shortened URIs …

How should a search engine handle a shortened URI?

Handling an URI as shortened URL requires a bullet proof method to detect shortened URIs. That’s a breeze.

Redirect patterns: URI shorteners receive lots of external inbound links that get redirected to 3rd party sites. Linking pages, stopovers and destination pages usually reside on different domains. The method of redirection can vary. Most URI shorteners perform 301 redirects, some use 302 or 307 HTTP response codes, some frame the destination page displaying ads on the top frame, and I’ve seen even a few of them making use of meta refreshs and client sided redirects. Search engines can detect all those procedures.
Link appearance: redirecting URIs that belong to URI shorteners often appear on pages and in feeds hosted by social media services (Twitter, Facebook & Co).
Seed: trusted sources like LongURL.org provide lists of domains owned by URI shortening services. Social media outlets providing their own URI shorteners don’t hide server name patterns (like su.pr …).
Self exposure: the root index pages of URI shorteners, as well as other pages on those domains that serve a 200 response code, usually mention explicit terms like “shorten your URL” et cetera.
URI length: the length of an URI string, if less or equal 20 characters, is an indicator at most, because some URI shortening services offer keyword rich short URIs, and many sites provide natural URIs this short.

Search engine crawlers bouncing at short URIs should do a lookup, following the complete chain of redirects. (Some whacky services shorten everything that looks like an URI, even shortened URIs, or do a lookup themselves replacing the original short URI with another short URI that they can track. Yup, that’s some crazy insanity.)

Each and every stopover (shortened URI) should get indexed as an alias of the destination page, but must not appear on SERPs unless the search query contains the short URI or the destination URI (that means not on [site:tinyurl.com] SERPs, but on a [site:tinyurl.com shortURI] or a [destinationURI] search result page). 3rd party stopovers mustn’t gain reputation (PageRank™, anchor text, or whatever), regardless the method of redirection. All the link juice belongs to the destination page.

In other words: search engines should make use of their knowledge of shortened URIs in response to navigational search queries. In fact, search engines could even solve the problem of vanished and abused short URIs.

Now let’s see how major search engines handle shortened URIs, and how they could improve their SERPs.

Bing doesn’t get redirects at all

Bing 301 messed up SERPs Oh what a mess. The candidate from Redmond fails totally on understanding the HTTP protocol. Their search index is flooded with a bazillion of URI-only listings that all do a 301 redirect, more than 200,000 from tinyurl.com alone. Also, you’ll find URIs that do a permanent redirect and have nothing to do with URI shortening in their index, too.

I can’t be bothered with checking what Bing does in response to other redirects, since the 301 test fails so badly. Clicking on their first results for [site:tinyurl.com], I’ve noticed that many lead to mailto://working-email-addy type of destinations. Dear Bing, please remove those search results as soon as possible, before anyone figures out how to use your SERPs/APIs to launch massive email spam campaigns. As for tips on how to improve your short-URI-SERPs, please learn more under Yahoo and Google.

Yahoo does an awesome job, with a tiny exception

Yahoo has done a better job. They index short URIs and show the destination page, at least via their site explorer. When I search for a tinyURL, the SERP link points to the URI shortener, that could get improved by linking to the destination page.

By the way, Yahoo is the only search engine that handles abusive short-URIs totally right (I will not elaborate on this issue, so please don’t ask for detailled information if you’re not a SE engineer). Yahoo bravely passed the 301 test, as well as others (including pretty evil tactics). I so hope that MSN will adopt Yahoo’s bright logic before Bing overtakes Yahoo search. By the way, that can be accomplished without sending out spammy bots (hint2bing).

Google does it by the book, but there’s room for improvements

Google fails with merits As for tinyURLs, Google indexes only pages on the tinyurl.com domain, including previews. Unfortunately, the snippets don’t provide a link to the destination page. Although that’s the expected behavior (those URIs aren’t linked on the crawled page), that’s sad. At least Google didn’t fail on the 301 test.

As for the somewhat evil tactis I’ve applied in my tests so far, Google fell in love with some abusive short-URIs. Google -under particular circumstances- indexes shortened URIs that game Googlebot, having sent SERP traffic to sneakily shortened URIs (that face the searcher with huge ads) instead of the destination page. Since I’ve begun to deploy sneaky sURLs, Google greatly improved their spam filters, but they’re not yet perfect.

Since Google is responsible for most of this planet’s SERP traffic, I’ve put better sURL handling at the very top of my xmas wish list.

About abusive short URIs

Shortened URIs do poison the Internet. They vanish, alter their destination, mislead surfers … in other words they are abusive by definition. There’s no such thing as a persistent short URI!

Long time ago Tim Berners-Lee told you that ~~URI shorteners are evil~~ fucking with URIs is a very bad habit. Did you listen? Do you make use of shortened URIs? If you post URIs that get shortened at Twitter, or if you make use of 3rd party URI shorteners elsewhere, consider yourself trapped into a low-life traffic theft scam. Shame on you, and shame on Twitter & Co.

fight evil URI shorteners Besides my somewhat shady experiments that hijacked URIs, stole SERP positions, and converted “borrowed” SERP traffic, there are so many other ways to abuse shortened URIs. Many of them are outright evil. Many of them do hurt your kids, and mine. Basically, that’s not any search engine’s problem, but search engines could help us getting rid of the root of all sURL evil by handling shortened URIs with common sense, even when the last short URI has vanished.

Fight shortened URIs!

It’s up to you. Go stop it. As long as you can’t avoid URI shortening, roll your own URI shortener and make sure it can’t get abused. For the sake of our children, do not use or support 3rd party URI shorteners. Deprive the livelihood of these utterly useless scumbags.

Unfortunately, as a father and as a webmaster, I don’t believe in common sense applied by social media services. Hence, I see a “Twitter actively bypasses safe-search filters tricking my children into viewing hardcore porn” post coming. Dear Twitter & Co. — and that addresses all services that make use of or transport shortened URIs — put and end to shortened URIs. Now!

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

6 comments Sebastian | Social Web, MSN, URI shortening, Search Quality, Spam, Yahoo, Risky Linkage, Twitter, Google

Opting out: mailto://me is history

Posted on 27 May, 2009

Finally quitting email Today I’ve removed all instances of the thunderbird icon from my computers, and from my memory as well. I’m finally done with email. I’ve forwarded¹⁾ all my email accounts to [email protected], and here’s why:

Sebastian’s Pamphlets

Dear Sebastian,

I visited your web site earlier today and it seems you are also a seo company like us. As an SEO company we are in this field since 1998 in India(CHD). We have developed and maintained high quality websites.

We understand link building better than other because of our 11 year experience in linking industry and we follows the right manual link building approach in seeking, obtaining and attracting topic specific trusted inbound links. We have different themes related sites, directories and blogs and i would like to make a request to enter a mutual understanding by EXCHANGING LINKS with your website in order to get targeted visitors, higher ranking and link popularity.

We look forward to linking our site with yours, as exchanging links would Benefit both of us.

You\’ve received this email simply because you have been found while searching for related sites in Google, MSN and Yahoo If you do not wish to receive future emails, simply reply with this email and let us know.

Waiting for your positive and quick response.

PLEASE NOTE THAT THIS IS NOT A SPAM OR AUTOMATED EMAIL, IT\’S ONLY A REQUEST FOR A LINK EXCHANGE. YOUR EMAIL ADDRESS HAS NOT BEEN ADDED TO ANY LISTS, AND YOU WILL NOT BE CONTACTED AGAIN.

Regards:
Lara

Lara
Megrisoft
[email protected]

Direct message from Spamdiggalot

Hi, Sebastian.

You have a new direct message:

Spamdiggalot: hi!I think you should like my article “12 addons to get the most out of safer-sex”, here: digg.com/x010101 please RT!

Reply on the web at http://twitter.com/direct_messages/create/Spamdiggalot

Send me a direct message from your phone: D SPAMDIGGALOT

our company proposal

Dear Sebastian Pamphlets,

My name is Vincentas and I am member of board in multi-location hosting company - Host1Plus (http:// www . host1plus . com). Our servers are in U.S., U.K., the Netherlands, Germany, Lithuania and Singapore.

I just visited your website which I found interested and it provides excellent complementary content.
We would like to offer you free hosting for your site in Host1Plus hosting service the only thing we would ask you is to place our visitors counter to your website here is the link http:// www . count1plus . com or it could be any other feature.

So let me know if you are interested for my offer and I hope that offer is interested to you. Hope to hear you soon.

Kind Regards,
Vincentas Grinius

Host1Plus.com Team
part of Digital Energy Technologies Ltd.
26 York Street
London

W1U 6PZ
United Kingdom
T: +44 (0) 808 101 2277
E: [email protected]
W: http:// www . host1plus . com

Vincentas Grinius
Host1Plus.com
[email protected]

Link Exchange

Hi,

I think if I receive something like this I would pay more attention to that.
\”Dear Webmaster I am so happy to find your website and I like it so much! So I want to be a link partner of your site.

If you are interested to make us your link partner , please inform us and we will be glad to make our link partner within 24 hours.

Our Link Details :

Title: Social Network Development UK

URL: http:// www . dassnagar . co . uk/

Description: Web Development Company UK: Premier Interactive Agency, specializing in custom website design, Social network development, Sports betting portal development, Travel portal design, Flash gaming portal design and development.

Link\’s HTML Code:

<a href=\”http:// www . dassnagar . co . uk/\” target=\”new\”>Social Network Development UK
</a> Web Development Company UK: Premier Interactive Agency, specializing in custom website design, Social network development, Sports betting portal development, Travel portal design, Flash gaming portal design and development.

Please accept my apology if already partner or not interested.

Reasons to exchange link with us.

1. Our site is regularly crawled by google, so there are better chances googlebot visiting your website regularly.
2. We ask you to link back to only those pages where your url is present, indirectly you are increasing your own link value.
3. By linking to our articles and technology blog you can provide useful content to your visitors.

This is an advertisement and a promotional mail strictly on the guidelines of CAN-SPAM act of 2003 . We have clearly mentioned the source mail-id of this mail, also clearly mentioned the subject lines and they are in no way misleading in any form. We have found your mail address through our own efforts on the web search and not through any illegal way. If you find this mail unsolicited, please reply with \”Unsubscribe\” in the subject line and we will take care that you do not receive any further promotional mail.

Please feel free to contact me if you have any questions.

Kind regards,
Tom
Webmaster

John
dassnagar . co . uk
[email protected]

Trust me, quitting email is a time-saver. And yes, I’ve an idea how to waste the additional spare time: Tomorrow I’ll have paid me a beer for a link to myself. And I can think of way more link monkey business that doesn’t involve email.

I'm such a devil!

¹⁾ Actually, “forwarding” comes with a slighly shady downside:
If you continue to send me your (unsolicited) emails, you’ll find all your awkward secrets on literally tons of automatically generated Web pages -nicely plastered with very targeted ads and usually x-rated or otherwise NSFW banners-, hosted on throw-away domains.
I’m such a devil.

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

2 comments Sebastian | Spam, Link Building, Internet Marketing, Reciprocal Links, Fun, Paid Links, Spam Report, Risky Linkage, Crap

Dealing with spamming content thieves / plagiarists (oylinki.com)

Posted on 21 January, 2008

Dealing with plagiarists When it comes to crap like plagiarism you shouldn’t consider me a gentleman.

If assclowns like Veronica Domb steal my content and publish it along with likewise stolen comments on their blatantly spamming site oylinki.com, I’m somewhat upset.

Then when I leave a polite note asking the thief Veronica Domb from EmeryVille to remove my stuff asap, see my comment marked as “in moderation”, but neither my content gets removed nor my comment is published within 24 hours, I stay annoyed.

When I’m annoyed, I write blog posts like this one. I’m sure it will rank high enough for [Veronica Domb] when the assclown’s banker or taxman searches for her name. I’m sure it’ll be visible on any SERP that any other (potential) business partner submits at a major search engine.

Content Thieves Veronica Domb et al, P.O.BOX 99800, EmeryVille, 94662, CA are blatant spammers

Hey, outing content thieves is way more fun than filing boring DMCA complaints, and way more effective. Plagiarists do ego searches too, and from now on Veronica Domb from EmeryVille will find the footsteps of her criminal activities on the Web with each and every ego search. Isn’t that nice?

Not. Of course Veronica Domb is a pseudonym of Slade Kitchens, Jamil Akhtar, … However, some plagiarists and scam artists aren’t smart enough to hide their identity, so watch out.

Maybe I’ve done some companies a little favor, because they certainly don’t need to sent out money sneakily “earned” with Web spam and criminal activities that violate the TOS of most affiliate programs.

AdBrite will love to cancel the account for these affiliate links: http://ads.adbrite.com/mb/text_group.php?sid=448245&br=1 &dk=736d616c6c20627573696e6573735f355f315f776562 http://www.adbrite.com/mb/commerce/purchase_form.php?opid=448245&afsid=1

Google’s webspam team as well as other search engines will most likely delist oylinki.com that comes with 100% stolen text and links and faked whois info as well.

Spamcop and alike will happily blacklist oylinki.com (IP: 66.199.174.80 , cwh2.canadianwebhosting.com) because the assclown’s blog software sends out email spam masked as trackbacks.

If anybody is interested, here’s a track of the real “Veronica Domb” from Canada clicking the link to this post from her WP admin panel: 74.14.107.36 - - [21/Jan/2008:07:50:40 -0500] "GET /outing-plagiarist-2008-01-21/ HTTP/1.1" 200 9921 "http://oylinki.com/blog/wp-admin/edit-comments.php" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; SU 3.005; .NET CLR 1.1.4322; InfoPath.1; Alexa Toolbar; .NET CLR 2.0.50727)"

Common sense is not as common as you think.

Disclaimer: I’ve outed plagiarists in the past, because it works. Whether you do that on ego-SERPs or not depends on your ethics. Some folks think that’s even worse than theft and spamming. I say that publishing plagiarisms in the first place deserves bad publicity.

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

11 comments Sebastian | Webspam, Blogging, Spam, Copyrights, Plagiarism, Crap

MSN spam to continue says the Live Search Blog

Posted on 5 December, 2007

It seems MSN/LiveSearch has tweaked their rogue bots and continues to spam innocent Web sites just in case they could cloak. I see a rant coming, but first the facts and news.

Since August 2007 MSN runs a bogus bot faking a human visitor coming from a search results page, that follows their crawler. This spambot downloads everything from a page, that is images and other objects, external CSS/JS files, and ad blocks rendering even contextual advertising from Google and Yahoo. It fakes MSN SERP referrers diluting the search term stats with generic and unrelated keywords. Webmasters running non-adult sites wondered why a database tutorial suddenly ranks for [oral sex] and why MSN sends visitors searching for [MILF pix] to a teenager’s diary. Webmasters assumed that MSN is after deceitful cloaking, and laughed out loud because their webspam detection method was that primitive and easy to fool.

Now MSN admits all their sins -except the launch of a porn affiliate program- and posted a vague excuse on their Webmaster Blog telling the world that they discovered the evil cloakers and their index is somewhat spam free now. Donna has chatted with the MSN spam team about their spambot and reports that blocking its IP addresses is a bad idea, even for sites that don’t cloak. Vanessa Fox summarized MSN’s poor man’s cloaking detection at Search Engine Land:

And one has to wonder how effective methods like this really are. Those savvy enough to cloak may be able to cloak for this new cloaker detection bot as well.

They say that they no longer spam sites that don’t cloak, but reverse this statement telling Donna

we need to be able to identify the legitimate and illegitimate content

and Vanessa

sites that are cloaking may continue to see some amount of traffic from this bot. This tool crawls sites throughout the web — both those that cloak and those that don’t — but those not found to be cloaking won’t continue to see traffic.

Here is an excerpt from yesterdays referrer log of a site that does not cloak, and never did: http://search.live.com/results.aspx?q=webmaster&mrt=en-us&FORM=LIVSOP http://search.live.com/results.aspx?q=smart&mrt=en-us&FORM=LIVSOP http://search.live.com/results.aspx?q=search&mrt=en-us&FORM=LIVSOP http://search.live.com/results.aspx?q=progress&mrt=en-us&FORM=LIVSOP http://search.live.com/results.aspx?q=google&mrt=en-us&FORM=LIVSOP http://search.live.com/results.aspx?q=google&mrt=en-us&FORM=LIVSOP http://search.live.com/results.aspx?q=domain&mrt=en-us&FORM=LIVSOP http://search.live.com/results.aspx?q=database&mrt=en-us&FORM=LIVSOP http://search.live.com/results.aspx?q=content&mrt=en-us&FORM=LIVSOP http://search.live.com/results.aspx?q=business&mrt=en-us&FORM=LIVSOP
Why can’t the MSN dudes tell the truth, not even when they apologize?

Another lie is “we obey robots.txt”. Of course the spambot doesn’t request it to bypass bot traps, but according to MSN it uses a copy served to the LiveSearch crawler “msnbot”:

Yes, this robot does follow the robots.txt file. The reason you don’t see it download it, is that we use a fresh copy from our index. The tool does respect the robots.txt the same way that MSNBot does with a caveat; the tool behaves like a browser and some files that a crawler would ignore will be viewed just like real user would.

In reality, it doesn’t help to block CSS/JS files or images in robots.txt, because MSN’s spambot will download them anyway. The long winded statement above translates to “We promise to obey robots.txt, but if it fits our needs we’ll ignore it”.

Well, MSN is not the only search engine running stealthy bots to detect cloaking, but they aren’t clever enough to do it in a less abusive and detectable way.

Their insane spambot led all cloaking specialists out there to their not that obvious spam detection methods. They may have caught a few cloaking sites, but considering the short life cycle of Webspam on throwaway domains they shot themselves in both feet. What they really have achieved is that the cloaking scripts are MSN spam detection immune now.

Was it really necessary to annoy and defraud the whole Webmaster community and to burn huge amounts of bandwidth just to catch a few cloakers who launched new scripts on new throwaway domains hours after the first appearance of the MSN spam bot?

Can cosmetic changes with regard to their useless spam activities restore MSN’s lost reputation? I doubt it. They’ve admitted their miserable failure five months too late. Instead of dumping the spambot, they announce that they’ll spam away for the foreseeable future. How silly is that? I thought Microsoft is somewhat profit orientated, why do they burn their and our money with such amateurish projects?

Besides all this crap MSN has good news too. Microsoft Live Search told Search Engine Roundtable that they’ll spam our sites with keywords related to our content from now on, at least they’ll try it. And they have a forum and a contact form to gather complaints. Crap on, so much bureaucratic efforts to administer their ridiculous spam fighting funeral. They’d better build a search engine that actually sends human traffic.

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

9 comments Sebastian | Spoofing, MSN, Search Quality, Webspam, Crap, Spam, Cloaking

Microsoft funding bankrupt Live Search experiment with porn spam

Posted on 16 November, 2007

If only this headline would be linkbait … of course it’s not sarcastic.

Rumors are out that Microsoft will launch a porn affiliate programm soon. The top secret code name for this project is “pornbucks”, but analysts say that it will be launched as “M$ SMUT CASH” next year or so.

Since Microsoft just can’t ship anything in time, and the usual delays aren’t communicated internally, their search dept. began to promote it to Webmasters this summer.

Surprisingly, Webmasters across the globe weren’t that excited to find promotinal messages from Live Search in their log files, so a somewhat confused MSN dude posted a lame excuse to a large Webmaster forum.

Meanwhile we found out that Microsoft Live Search does not only target the adult entertainment industry, they’re testing the waters with other money terms like travel or pharmaceutic products too.

Anytime soon the Live Search menu bar will be updated to something like this:

Here is the sad -but true- story of a search engine’s downfall.

A few months ago Microsoft Live Search discovered that x-rated referrer spam is a must-have technique in a sneaky smut peddlar’s marketing toolbox.

Since August 2007 a bogus Web robot follows Microsoft’s search engine crawler “MSNbot” to spam the referrer logs of all Web sites out there with URLs pointing to MSN search result pages featuring porn.

Read your referrer logs and you’ll find spam from Microsoft too, but perhaps they peeve you with viagra spam, offer you unwanted but cheap payday loans, or try to enlarge your penis. Of course they know every trick in the book on spam, so check for harmless catchwords too. Here is an example URL: http://search.live.com/results.aspx?q= spammy-keyword &mrt=en-us&FORM=LIVSOP

Microsoft’s spam bot not only leaves bogus URLs in log files, hoping that Webmasters will click them on their referrer stats pages and maybe sign up for something like “M$ Porn Bucks” or so. It downloads and renders even adverts powered by their rival Google, lowering their CTR; obviously to make programs like AdSense less attractive im comparison with Microsoft’s own ads (sorry, no link love from here).

Let’s look at Microsoft’s misleading statement:

The traffic you are seeing is part of a quality check we run on selected pages. While we work on addressing your conerns, we would request that you do not actively block the IP addreses used by this quality check; blocking these IP addresses could prevent your site from being included in the Live Search index.

That’s not traffic, that’s bot activity: These hits come within seconds of being indexed by MSNBot. The pattern is like this: the page is requested by MSNBot (which is authenticated, so it’s genuine) and within a few seconds, the very same page is requested with a live.com search result URL as referer by the MSN spam bot faking a human visitor.
If that’s really a quality check to detect cloaking, that’s more than just lame. The IP addresses don’t change, the bogus bot uses a static user agent name, and there are other footprints which allow every cloaking script out there to serve this sneaky bot the exact same spider fodder that MSNbot got seconds before. This flawed technique might catch poor man’s cloaking every once in a while, but it can’t fool savvy search marketers.
The FUD “could prevent your site from being included in the Live Search index” is laughable, because in most niches MSN search traffic is not existent.

All major search engines, including MSN, promise that they obey the robots exclusion standard. Obeying robots.txt is the holy grail of search engine crawling. A search engine that ignores robots.txt and other normed crawler directives cannot be trusted. The crappy MSN bot not even bothers to read robots.txt, so there’s no chance to block it with standardized methods. Only IP blocking can keep it out, but then it still seems to download ads from Google’s AdSense servers by executing the JavaScript code that the MSN crawler gathered before (not obeying Google’s AdSense robots.txt as well).

This unethical spam bot downloading all images, external CSS and JS files, and whatnot also burns bandwidth. That’s plain theft.

Since this method cannot detect (most) cloaking, and the so called “search quality control bot” doesn’t stop visiting sites which obviously do not cloak, it is a sneaky marketing tool. Whether or not Microsoft Live Search tries to promote cyberspace porn and on-line viagra shops plays no role. Even spamming with safe-at-work keywords is evil. Do these assclowns really believe that such unethical activities will increase the usage of their tiny and pretty unpopular search engine? Of course they do, otherwise they would have shutted down the spam bot months ago.

Dear reader, please tell me: what do you think of a search engine that steals (bandwidth and AdSense revenue), lies, spams away, and is not clever enough to stop their criminal activities when they’re caught?

Recently a Live Search rep whined in an interview because so many robots.txt files out there block their crawler:

One thing that we noticed for example while mining our logs is that there are still a fair number of sites that specifically only allow Googlebot and do not allow MSNBot.

There’s a suitable answer, though. Update your robots.txt:User-agent: MSNbot Disallow: /

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

10 comments Sebastian | Internet Marketing, MSN, Search Quality, Spam, Crawler Directives, Crap

Gaming Sphinn is not worth it

Posted on 3 November, 2007

Thou shalt not spam Sphinn! OMFG, yet another post on Sphinn? Yup. I tell you why gaming Sphinn is counter productive, because I just don’t want to read another whiny rant in the lines of “why do you ignore my stuff whilst A listers [whatever this undefined term means] get their crap sphunn hot in no time”. Also, discussions assuming that success equals bad behavior like this or this one aren’t exactly funny nor useful. As for the whiners: Grow the fuck up and produce outstanding content, then network politely but not obtrusive to promote it. As for the gamers: Think before you ruin your reputation!

What motivates a wannabe Internet marketer to game Sphinn?

Traffic of course, but that’s a myth. Sphinn sends very targeted traffic but also very few visitors (see my stats below).

Free uncondomized links. Ok, that works, one can gain enough link love to get a page indexed by the search engines, but for this purpose it’s not necessary to push the submission to the home page.

Attention is up next. Yep, Sphinn is an eldorado for attention whores, but not everybody is an experienced high-class call girl. Most are amateurs giving it a (first) try, or wrecked hookers pushing too hard to attract positive attention.

The keyword is positive attention. Sphinners are smart, they know every trick in the book. Many of them make a living with ~~gaming~~ creative use of social media. Cheating professional gamblers is a waste of time, and will not produce positive attention. Even worse, the shit sticks at the handle of the unsuccessful cheater (and in many cases the real name). So if you want to burn your reputation, go found a voting club to feed your crap.

Fortunately, getting caught for artificial voting at Sphinn comes with devalued links too. The submitted stories are taken off the list, that means no single link at Sphinn (besides profile pages) feeds them any more, hence search engines forget them. Instead of a good link from an unpopular submission you get zilch when you try to cheat your way to the popular links pages.

Although Sphinn doesn’t send shitloads of traffic, this traffic is extremely valuable. Many spinners operate or control blogs and tend to link to outstanding articles they found at Sphinn. Many sphinners have accounts on other SM sites too, and bookmark/cross-submit good content. It’s not unusual that 10 visits from Sphinn result in hundreds or even thousands of hits from StumbleUpon & Co. — but spinners don’t bookmark/blog/cross-submit/stumble crap.

So either write great content and play by the rules, or get nowhere with your crappy submission. The first “10 reasons why 10 tricks posts about 10 great tips to write 10 numbered lists” submission was fun. The 10,000 plagiarisms following were just boring noise. Nobody except your buddies or vote bots sphinn crap like that, so don’t bother to provide the community with footprints of your lousy gaming.

If you’re playing number games, here is why ruining a reputation by gaming Sphinn is not worth it. Look at my visitor stats from July to today. I got 3.6k referrers in 4 months from Sphinn because a few of my posts went hot. When a post sticks with 1-5 votes, you won’t attract much more click throughs than from those 1-5 folks who sphunn it (that would give 100-200 hits or so with the same amount of submissions). When you cheat, the story gets buried and you get nothing but flames. Think about that. Thanks.

Rank	Last Date/Time	Referral Site	Count
1	Oct 09, 2007 @ 23:29	http: / / sphinn.com/ story/ 1622	504
2	Oct 23, 2007 @ 14:53	http: / / sphinn.com/ story/ 2764	419
3	Nov 01, 2007 @ 03:42	http: / / sphinn.com	293
4	Oct 08, 2007 @ 04:21	http: / / sphinn.com/ story/ 5469	288
5	Nov 02, 2007 @ 13:35	http: / / sphinn.com/ story/ 8883	192
6	Oct 09, 2007 @ 23:38	http: / / sphinn.com/ story/ 4335	185
7	Oct 22, 2007 @ 23:55	http: / / sphinn.com/ story/ 5362	139
8	Oct 29, 2007 @ 15:02	http: / / sphinn.com/ upcoming	131
9	Nov 02, 2007 @ 13:34	http: / / sphinn.com/ story/ 7170	131
10	Sep 10, 2007 @ 09:09	http: / / sphinn.com/ story/ 1976	116
11	Oct 15, 2007 @ 22:40	http: / / sphinn.com/ story/ 6122	113
12	Sep 22, 2007 @ 13:39	http: / / sphinn.com/ story/ 3593	90
13	Oct 05, 2007 @ 21:56	http: / / sphinn.com/ story/ 5648	87
14	Sep 22, 2007 @ 13:25	http: / / sphinn.com/ story/ 4072	80
15	Oct 14, 2007 @ 17:24	http: / / sphinn.com/ story/ 5973	77
16	Aug 30, 2007 @ 04:17	http: / / sphinn.com/ story/ 1796	72
17	Oct 16, 2007 @ 05:46	http: / / sphinn.com/ story/ 6761	61
18	Oct 11, 2007 @ 05:56	http: / / sphinn.com/ story/ 1447	60
19	Sep 13, 2007 @ 12:27	http: / / sphinn.com/ story/ 4548	54
20	Nov 02, 2007 @ 22:14	http: / / sphinn.com/ story/ 11547	53
21	Sep 03, 2007 @ 09:34	http: / / sphinn.com/ story/ 4068	44
22	Oct 09, 2007 @ 23:40	http: / / sphinn.com/ story/ 5093	42
23	Nov 02, 2007 @ 01:46	http: / / sphinn.com/ story/ 248	41
24	Sep 14, 2007 @ 05:58	http: / / sphinn.com/ story/ 2287	36
25	Oct 31, 2007 @ 06:17	http: / / sphinn.com/ story/ 11205	35
26	Oct 07, 2007 @ 12:07	http: / / sphinn.com/ story/ 6124	25
27	Nov 01, 2007 @ 09:41	http: / / sphinn.com/ user/ view/ profile/ Sebastian	22
28	Aug 08, 2007 @ 10:52	http: / / sphinn.com/ story/ 245	21
29	Sep 02, 2007 @ 19:17	http: / / sphinn.com/ story/ 3877	17
30	Sep 22, 2007 @ 00:42	http: / / sphinn.com/ story/ 4968	17
31	Oct 01, 2007 @ 12:49	http: / / sphinn.com/ story/ 5310	17
32	Aug 30, 2007 @ 08:20	http: / / sphinn.com/ story/ 4143	14
33	Sep 11, 2007 @ 21:38	http: / / sphinn.com/ story/ 3783	13
34	Nov 01, 2007 @ 15:50	http: / / sphinn.com/ published/ page/ 2	11
35	Sep 01, 2007 @ 23:03	http: / / sphinn.com/ story/ 597	10
36	Oct 24, 2007 @ 18:17	http: / / sphinn.com/ story/ 1767	10
37	Sep 15, 2007 @ 08:26	http: / / sphinn.com/ story.php? id= 5469	8
38	Oct 30, 2007 @ 09:42	http: / / sphinn.com/ upcoming/ mostpopular	7
39	Oct 24, 2007 @ 18:38	http: / / sphinn.com/ story/ 10881	7
40	Oct 30, 2007 @ 01:19	http: / / sphinn.com/ upcoming/ page/ 2	6
41	Sep 20, 2007 @ 07:09	http: / / sphinn.com/ user/ view/ profile/ login/ Sebastian	5
42	Jul 22, 2007 @ 09:39	http: / / sphinn.com/ story/ 1017	5
43	Oct 13, 2007 @ 08:34	http: / / sphinn.com/ published/ week	5
44	Sep 08, 2007 @ 04:17	http: / / sphinn.com/ story/ 4653	5
45	Oct 31, 2007 @ 06:55	http: / / sphinn.com/ story/ 11614	5
46	Aug 13, 2007 @ 03:06	http: / / sphinn.com/ story/ 2764/ editcomment/ 4018	4
47	Aug 23, 2007 @ 07:52	http: / / sphinn.com/ story.php? id= 3593	4
48	Sep 20, 2007 @ 06:21	http: / / sphinn.com/ published/ page/ 1	4
49	Oct 23, 2007 @ 15:01	http: / / sphinn.com/ story/ 748	3
50	Jul 29, 2007 @ 10:47	http: / / sphinn.com/ story/ title/ Google- launched- a- free- ranking- checker	3
51	Sep 30, 2007 @ 21:13	http: / / sphinn.com/ category/ Google/ parent_ name/ Google	3
52	Aug 25, 2007 @ 04:47	http: / / sphinn.com/ story.php? id= 3735	3
53	Sep 15, 2007 @ 11:28	http: / / sphinn.com/ story.php? id= 5648	3
54	Sep 29, 2007 @ 01:35	http: / / sphinn.com/ story/ 7058	3
55	Oct 28, 2007 @ 22:56	http: / / sphinn.com/ greatesthits	3
56	Oct 23, 2007 @ 04:44	http: / / sphinn.com/ story/ 10380	3
57	Oct 27, 2007 @ 04:10	http: / / sphinn.com/ story/ 11233	3
58	Jul 13, 2007 @ 04:23	Google Search: http: / / sphinn.com	2
59	Jul 21, 2007 @ 03:19	http: / / sphinn.com/ story.php? id= 849	2
60	Jul 27, 2007 @ 10:06	http: / / sphinn.com/ story.php? id= 1447	2
61	Jul 30, 2007 @ 20:09	http: / / sphinn.com/ story.php? id= 1796	2
62	Aug 07, 2007 @ 10:01	http: / / sphinn.com/ published/ page/ 3	2
63	Aug 13, 2007 @ 11:20	http: / / sphinn.com/ story.php? id= 2764	2
64	Sep 05, 2007 @ 05:23	http: / / sphinn.com/ story/ 3735	2
65	Aug 28, 2007 @ 01:56	http: / / sphinn.com/ story.php? id= 3877	2
66	Aug 27, 2007 @ 10:01	http: / / sphinn.com/ submit.php? url= http: / / sebastians- pamphlets.com/ links/ categories	2
67	Aug 31, 2007 @ 14:13	http: / / sphinn.com/ story.php? id= 4335	2
68	Sep 02, 2007 @ 14:29	http: / / sphinn.com/ story.php? id= 1622	2
69	Sep 08, 2007 @ 19:48	http: / / sphinn.com/ story.php? id= 4548	2
70	Sep 05, 2007 @ 01:07	http: / / sphinn.com/ submit.php? url= http: / / sebastians- pamphlets.com/ why- ebay- and- wikipedia- rule- googles- serps	2
71	Sep 06, 2007 @ 13:22	http: / / sphinn.com/ published/ page/ 4	2
72	Sep 16, 2007 @ 13:30	http: / / sphinn.com/ story.php? id= 3783	2
73	Sep 18, 2007 @ 11:55	http: / / sphinn.com/ story.php? id= 5973	2
74	Sep 19, 2007 @ 08:15	http: / / sphinn.com/ story.php? id= 6122	2
75	Sep 19, 2007 @ 14:37	http: / / sphinn.com/ story.php? id= 6124	2
76	Oct 23, 2007 @ 00:07	http: / / sphinn.com/ story/ 10387	2
77	Jul 16, 2007 @ 18:21	http: / / sphinn.com/ upcoming/ category/ AllCategories/ parent_ name/ All Categories	1
78	Jul 19, 2007 @ 20:19	http: / / sphinn.com/ story/ 864	1
79	Jul 20, 2007 @ 15:57	http: / / sphinn.com/ story/ title/ Buy- Viagra- from- Reddit	1
80	Jul 27, 2007 @ 10:48	http: / / sphinn.com/ story/ title/ Blogger- to- rule- search- engine- visibility	1
81	Jul 31, 2007 @ 06:07	http: / / sphinn.com/ story/ title/ The- Unavailable- After- tag- is- totally- and- utterly- useless	1
82	Aug 02, 2007 @ 14:45	http: / / sphinn.com/ user/ view/ history/ login/ Sebastian	1
83	Aug 03, 2007 @ 10:59	http: / / sphinn.com/ story.php? id= 1976	1
84	Aug 06, 2007 @ 03:59	http: / / sphinn.com/ user/ view/ commented/ login/ Sebastian	1
85	Aug 15, 2007 @ 08:27	http: / / sphinn.com/ category/ LinkBuilding	1
86	Aug 15, 2007 @ 14:17	http: / / sphinn.com/ story/ 2764/ editcomment/ 4362	1
87	Aug 28, 2007 @ 13:42	http: / / sphinn.com/ story/ 849	1
88	Sep 09, 2007 @ 15:15	http: / / sphinn.com/ user/ view/ commented/ login/ flyingrose	1
89	Sep 10, 2007 @ 05:15	http: / / sphinn.com/ published/ page/ 20	1
90	Sep 10, 2007 @ 05:55	http: / / sphinn.com/ published/ page/ 19	1
91	Sep 11, 2007 @ 12:22	http: / / sphinn.com/ published/ page/ 8	1
92	Sep 11, 2007 @ 23:13	http: / / sphinn.com/ category/ Blogging	1
93	Sep 12, 2007 @ 09:04	http: / / sphinn.com/ story.php? id= 5362	1
94	Sep 13, 2007 @ 06:36	http: / / sphinn.com/ category/ GoogleSEO/ parent_ name/ Google	1
95	Sep 14, 2007 @ 08:21	http: / / hwww.sphinn.com	1
96	Sep 16, 2007 @ 14:52	http: / / sphinn.com/ GoogleSEO/ Did- Matt- Cutts- by- accident- reveal- a- sure- fire- procedure- to- identify- supplemental- results	1
97	Sep 18, 2007 @ 08:05	http: / / sphinn.com/ story/ 5721	1
98	Sep 18, 2007 @ 09:08	http: / / sphinn.com/ story/ title/ If- yoursquore- not- an- Amway- millionaire- avoid- BlogRush- like- the- plague	1
99	Sep 18, 2007 @ 10:02	http: / / sphinn.com/ story/ 5973#wholecomment8559	1
100	Sep 19, 2007 @ 11:48	http: / / sphinn.com/ user/ view/ voted/ login/ bhancock	1
101	Sep 19, 2007 @ 20:27	http: / / sphinn.com/ published/ page/ 5	1
102	Sep 20, 2007 @ 00:39	http: / / blogmarks.net/ my/ marks,new? title= How to get the perfect logo for your blog& url= http: / / sebastians- pamphlets.com/ how- to- get- the- perfect- logo- for- your- blog/ & summary= & via= http: / / sphinn.com/ story/ 6122	1
103	Sep 20, 2007 @ 01:34	http: / / sphinn.com/ user/ page/ 3/ voted/ Wiep	1
104	Sep 24, 2007 @ 15:49	http: / / sphinn.com/ greatesthits/ page/ 3	1
105	Sep 24, 2007 @ 19:51	http: / / sphinn.com/ story.php? id= 6761	1
106	Sep 24, 2007 @ 22:32	http: / / sphinn.com/ greatesthits/ page/ 2	1
107	Sep 26, 2007 @ 15:13	http: / / sphinn.com/ story.php? id= 7170	1
108	Sep 29, 2007 @ 05:27	http: / / sphinn.com/ category/ SphinnZone	1
109	Oct 09, 2007 @ 11:44	http: / / sphinn.com/ story.php? id= 8883	1
110	Oct 10, 2007 @ 10:04	http: / / sphinn.com/ published/ month	1
111	Oct 24, 2007 @ 15:07	http: / / sphinn.com/ story.php? id= 10881	1
112	Oct 26, 2007 @ 09:53	http: / / sphinn.com/ story.php? id= 11205	1
113	Oct 30, 2007 @ 08:58	http: / / sphinn.com/ upcoming/ page/ 3	1
114	Oct 30, 2007 @ 12:31	http: / / sphinn.com/ upcoming/ most	1
Total			3,688

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

21 comments Sebastian | Internet Marketing, Social Web, Spam, Crap

1 | 2 | 3 Next Page »

Archived posts from the 'Spam' Category

Playground

syndication-source

original-source

Fine-prints

It’s a field test

Spam potential

Last but not least, a fair warning

Ingredients

Preparation

Serving

Why is it bad?

Spamming URIs with utm tracking variables breaks lots of things

What can I do to avoid URI spam on my site?

What can Google do?

Speak out!

Jump station

Background

Why does this paradigm shift puts your site at risk?

So what can you do to prevent your feeds from unwanted content?

How should a search engine handle a shortened URI?

Bing doesn’t get redirects at all

Yahoo does an awesome job, with a tiny exception

Google does it by the book, but there’s room for improvements

About abusive short URIs

Fight shortened URIs!

Sebastian’s Pamphlets

Direct message from Spamdiggalot

our company proposal

Link Exchange

Content Thieves Veronica Domb et al, P.O.BOX 99800, EmeryVille, 94662, CA are blatant spammers

Categories

Monthly Archives

Links

RSS Feeds