When you run a website that deals with user generated content (UGC), this pamphlet is for you.
#bad_news Otherwise you might enjoy it.
Not that the recent –and ongoing– paradigm shift shift in crawling, indexing, and ranking is bad news in general. On the contrary, for the tech savvy webmaster it comes with awesome opportunities with regard to traffic generation and search engine optimization. In this pamphlet I’ll blather about pitfalls, leaving chances unmentioned.
For a moment forget everything you’ve heard about traditional crawling, indexing and ranking. Don’t buy that search engines solely rely on ancient technologies like fetching and parsing Web pages to scrape links and on-the-page signals out of your HTML. Just because someone’s link to your stuff is condomized, that doesn’t mean that Google & Co don’t grab its destination instantly.
More and more, search engines crawl, index, and rank stuff hours, days, or even weeks before they actually bother to fetch the first HTML page that carries a (nofollow’ed) link pointing to it. For example, Googlebot might follow a link you’ve tweeted right when the tweet appears on your timeline, without crawling
http://twitter.com/your-user-name or the timeline page of any of your followers. Magic? Nope. The same goes for favs/retweets, stumbles, delicious bookmarks etc., by the way.
Guess why Google encourages you to make your ATOM and RSS feeds crawlable. Guess why FeedBurner publishes each and every update of your blog via PubSubHubbub so that Googlebot gets an alert of new and updated posts (and comments) as you release them. Guess why Googlebot is subscribed to FriendFeed, crawling everything (blog feed items, Tweets, social media submissions, comments …) that hits a FriendFeed user account in real-time. Guess why GoogleReader passes all your Likes and Shares to Googlebot. Guess why Bing and Google are somewhat connected to Twitter’s database, getting all updates, retweets, fav-clicks etc. within a few milliseconds.
Because all these data streams transport structured data that are easy to process. Because these data get pushed to the search engine. That’s way cheaper, and faster, than polling a gazillion of sources for updates 24/7/365.
Making use of structured data (XML, RSS, ATOM, PUSHed updates …) enables search engines to index fresh content, as well as reputation and other off-page signals, on the fly. Many ranking signals can be gathered from these data streams and their context, others are already on file. Even if a feed item consists of just a few words and a link (e.g. a tweet or stumble-thumbs-up), processing the relevant on-the-page stuff from the link’s final destination by parsing cluttered HTML to extract content and recommendations (links) doesn’t really slow down the process.
Later on, when the formerly fresh stuff starts to decompose on the SERPs, signals extracted from HTML sources kick in, for example link condoms, link placement, context and so on. Starting with the discovery of a piece of content, search engines permanently refine their scoring, until the content finally drops out of scope (spam filtering, unpaid hosting bills and other events that make Web content disappear).
Traditional discovery crawling, indexing, and ranking doesn’t exactly work for real-time purposes, nor in near real-time. Not even when a search engine assigns a bazillion of computers to this task. Also, submission based crawling is not exactly a Swiss Army knife when it comes to timely content. Although XML-sitemaps were a terrific accelerator, they must be pulled for processing, hence a delay occurs by design.
Nothing is like it used to be. Change happens.
Why does this paradigm shift puts your site at risk?
As a matter of fact, when you publish user generated content, you will get spammed. Of course, that’s bad news of yesterday. Probably you’re confident that your anti-spam defense lines will protect you. You apply link condoms to UGC link drops and all that. You remove UGC once you spot it’s spam that slipped through your filters.
Bad news is, your medieval palisade won’t protect you from a 21th century tank attack with air support. Why not? Because you’ve secured your HTML presentation layer, but not your feeds. There’s no such thing as a rel-nofollow microformat for URIs in feeds, and even condomized links transported as CDATA (in content elements) are surrounded by spammy textual content.
Feed items come with a high risk. Once they’re released, they’re immortal and multiply themselves like rabbits. That’s bad enough in case a pissed employee ‘accidently’ publishes financial statements on your company blog. It becomes worse when seasoned spammers figure out that their submissions can make it into your feeds, and be it only for a few milliseconds.
If your content management system (CMS) creates a feed item on submission, search engines –in good company with legions of Web services– will distribute it all over the InterWeb, before you can hit the delete button. It will be cached, duplicated, published and reprinted … it’s out of your control. You can’t wipe out all of its instances. Never.
Congrats. You found a surefire way to piss off both your audience (your human feed subscribers getting their feed reader flooded with PPC spam), and search engines as well (you send them weird spam signals that rise all sorts of red flags). Also, it’s not desirable to make social media services –that you rely on for marketing purposes– too suspicious (trigger happy anti-spam algos might lock away your site’s base URI in an escape-proof dungeon).
So what can you do to prevent your feeds from unwanted content?
Before I discuss advanced feed protection, let me point you to a few popular vulnerabilities you might haven’t considered yet:
- No nay never use integers as IDs before you’re dead sure that a piece of submitted content is floral white as snow. Integer sequences produce guessable URIs. Instead, generate a UUID (aka GUID) as identifier. Yeah, I know that UUIDs make ugly URIs, but those aren’t predictable and therefore not that vulnerable. Once a content submission is finally approved, you can donate it a nice –maybe even meaningful– URI.
- No nay never use titles, subjects or so in URIs, not even converted text from submissions (e.g. ‘My PPC spam’ ==> ‘my_ppc_spam’). Why not? See above. And you don’t really want to create URIs that contain spammy keywords, or keywords that are totally unrelated to your site. Remember that search engines do index even URIs they can’t fetch, or which they can’t refetch, at least for a while.
- Before the final approval, serve submitted content with a “noindex,nofollow,noarchive,nosnippet” X-Robots-Tag in the HTTP header, and put a corresponding meta element in the HEAD section. Don’t rely on link condoms. Sometimes search engines ignore rel-nofollow as an indexer directive on link level, and/or decide that they should crawl the link’s destination anyway.
- Consider serving social media bots requesting a not yet approved piece of user generated content a 503 HTTP response code. You can compile a list of their IPs and user agent names from your raw logs. These bots don’t obey REP directives, that means they fetch and process your stuff regardless whether you yell “noindex” at them or not.
- For all burned (disapproved) URIs that were in use ensure that your server returns a 410-Gone HTTP status code, respectively perform a 301 redirect to a policy page or so to rescue link love that would get wasted otherwise.
- Your Web forms for content submissions should be totally AJAX’ed. Use CAPTCHAs and all that. Split the submission process into multiple parts, each of them talking to the server. Reject excessively rapid walk throughs, for example by asking for something unusual when a step gets completed in a too short period of time. With AJAX calls that’s painless for the legit user. Do not accept content submissions via standard GET or POST requests.
- Serve link builders coming from SERPs for [URL|story|link submit|submission your site’s topic] etc. your policy page, not the actual Web form.
- There’s more. With the above said, I’ve just begun to scrape the surface of a savvy spammer’s technical portfolio. There’s next to nothing a clever programmed bot can’t mimick. Be creative and think outside the box. Otherwise the spammers will be ahead of you in no time, especially when you make use of a standard CMS.
Having said that, lets proceed to feed protection tactics. Actually, there’s just one principle set in stone:
Make absolutely sure that submitted content can’t make it into your feeds (and XML sitemaps) before it’s finally approved!
The interesting question is: what the heck is a “final approval”? Well, that depends on your needs. Ideally, that’s you releasing each and every piece of submitted content. Since this approach doesn’t scale, think of ways to semi-automate the process. Don’t fully automate it, there’s no such thing as an infallible algo. Also, consider the wisdom of the crowd spammable (voting bots). Some spam will slip through, guaranteed.
Each and every content submission must survive a probation period, whereas it will not be included in your site’s feeds. Regardless who contributed it. Stick with the four-eye principle. Here are a few generic procedures you could adapt, respectively ideas which could inspire you:
- Queue submissions. Possible queues are Blocked, Quarantaine, Suspect, Probation, and finally Released. Define simple rules and procedures that anyone involved can follow. SOPs lack work arounds and loopholes by design.
- Stuck content submissions from new users who didn’t participate in other ways in quarantaine. Moderate this queue and only manually release into the probation queue what passes the moderator’s heuristics. Signup-submit-and-forget is a typical spammer behavior.
- Maintain black lists of domain names, IPs, countries, user agent names, unwanted buzzwords and so on. Use filters to arrest submissions that contain keywords you wouldn’t expect to match your site’s theme in the Blocked or Quarantaine queue.
- On submission fetch the link’s content and analyze it, don’t stick with heuristic checks of URIs, titles and descriptions. Don’t use methods like PHP’s
file_get_contentsthat don’t return HTTP response codes. You need to know whether a requested URI is the first one of a redirect chain, for example. Double check with a second request from another IP, preferably owned by a widely used ISP, with a standard browser’s user agent string, that provides an HTTP_REFERER, for example a Google SERP with a
qparameter populated with a search term compiled from the submission’s suggested anchor text. If the returned content differs too much, set a red flag.
- Maintain white lists, too. That’s a great way to reduce the amount of inavoidable false positives.
- If you have editorial staff or moderators, they should get a Release to Feed button. You can combine mod releases with a minimum number of user votes or so. For example you could define a rule like “release to feed if mod-release = true and num-trusted-votes > 10″.
- Categorize your user’s reputation and trustworthiness. A particular number of votes from trusted users could approve a submission for feed inclusion.
- Don’t automatically release submissions that have raised any flag. If that slows down the process, refine your flagging but don’t lower the limits.
- With all automatted releases, for example based on votings, oops, especially based on votings, implement at least one additional sanity check. For example discard votes from new users as well as from users with a low participation history, check the sequence of votes for patterns like similar periods of time between votings, and so on.
Disclaimer: That’s just some food for thoughts. I want to make absolutely clear that I can’t provide bullet-proof anti-spam procedures. Feel free to discuss your thoughts, concerns, questions … in the comments.
Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to Entries Comments All Comments