How to borrow relevance from authority pages with 307 redirects

Every once in a while I switch to Dr Evil mode. That’s a “do more evil” type of pamphlet. Don’t bother reading the disclaimer, just spam away …

Content theft with 307 redirectsWhy the heck should you invest valuable time into crafting out compelling content, when there’s a shortcut?

There are so many awesome Web pages out there, just pick some and steal their content. You say “duplicate content issues”, I say “don’t worry”. You say “copyright violation”, I say “be happy”. Below I explain the setup.

This somewhat shady IM technique is for you when you’re shy of automatted content generation.

Register a new (short!) domain and create a tiny site with a few pages of totally unique and somewhat interesting content. Write opinion pieces, academic papers or whatnot, just don’t use content generators or anything that cannot pass a human bullshit detector. No advertising. No questionable links. Instead, link out to authority pages. No SEO stuff like nofollow’ed links to imprints or so.

Launch with a few links from clean pages. Every now and then drop a deep link in relevant discussions on forums or social media sites. Let the search engines become familiar with your site. That’ll attract even a few natural inbound links, at least if your content is linkworthy.

Use Google’s Webmaster Console (GWC) to monitor your progress. Once all URIs from your sitemap are indexed and show in [site:yourwebspam.com] searches, begin to expand your site’s menu and change outgoing links to authority pages embedded in your content.

Create short URIs (LE 20 characters!) that point to authority pages. Serve search engine crawlers a 307, and human surfers a 301 redirect. Build deep links to those URIs, for example in tweets. Once you’ve gathered 1,000+ inbounds, you’ll receive SERP traffic. By the way, don’t buy the sandbox myths.

Watch the keywords page in you GWC account. It gets populated with keywords that appear only in content of pages you’ve hijacked with redirects. Watch your [site:yourwebspam.com] SERPs. Usually the top 10 keywords listed in the GWC report will originate from pages listed on the first [site:yourwebspam.com] SERPs, provided you’ve hijacked awesome content.

Add (new) keywords from pages that appear both in redirect destinations listed within the first 20 [site:yourwebspam.com] search results, as well as in the first 20 listed keywords, to articles you actually serve on your domain.

Detect SERP referrers (human surfers who’ve clicked your URIs on search result pages) and redirect those to sales pitches. That goes for content pages as well as for redirecting URIs (mimiking shortened URIs). Laugh all the way to the bank.

Search engines rarely will discover your scam. Of course shit happens, though. Once the domain is burned, just block crawlers, redirect everything else to your sponsors, and let the domain expire.

History: Content theft with 307 redirectsDisclaimer: Google has put an end to most 307 spam tactics. That’s why I’m publishing all this crap. Because watching decreasing traffic to spammy sites is frustrating. Deceptive 307′ing URIs won’t rank any more. Slowly, actually very slow, GWC reports follow suit.

What can we learn? Do not believe in the truth of search engine reports. Just because Google’s webmaster console tells you that Google thinks a keyword is highly relevant to your site, that doesn’t mean you’ll rank for it on their SERPs. Most probably GWC is not the average search engine spammer’s tool of the trade.



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Handle your (UGC) feeds with care!

When you run a website that deals with user generated content (UGC), this pamphlet is for you. #bad_news Otherwise you might enjoy it. #malicious_joy

Spam in your feeds

Jump station

Not that the recent –and ongoing– paradigm shift shift in crawling, indexing, and ranking is bad news in general. On the contrary, for the tech savvy webmaster it comes with awesome opportunities with regard to traffic generation and search engine optimization. In this pamphlet I’ll blather about pitfalls, leaving chances unmentioned.

Background

For a moment forget everything you’ve heard about traditional crawling, indexing and ranking. Don’t buy that search engines solely rely on ancient technologies like fetching and parsing Web pages to scrape links and on-the-page signals out of your HTML. Just because someone’s link to your stuff is condomized, that doesn’t mean that Google & Co don’t grab its destination instantly.

More and more, search engines crawl, index, and rank stuff hours, days, or even weeks before they actually bother to fetch the first HTML page that carries a (nofollow’ed) link pointing to it. For example, Googlebot might follow a link you’ve tweeted right when the tweet appears on your timeline, without crawling http://twitter.com/your-user-name or the timeline page of any of your followers. Magic? Nope. The same goes for favs/retweets, stumbles, delicious bookmarks etc., by the way.

Guess why Google encourages you to make your ATOM and RSS feeds crawlable. Guess why FeedBurner publishes each and every update of your blog via PubSubHubbub so that Googlebot gets an alert of new and updated posts (and comments) as you release them. Guess why Googlebot is subscribed to FriendFeed, crawling everything (blog feed items, Tweets, social media submissions, comments …) that hits a FriendFeed user account in real-time. Guess why GoogleReader passes all your Likes and Shares to Googlebot. Guess why Bing and Google are somewhat connected to Twitter’s database, getting all updates, retweets, fav-clicks etc. within a few milliseconds.

Because all these data streams transport structured data that are easy to process. Because these data get pushed to the search engine. That’s way cheaper, and faster, than polling a gazillion of sources for updates 24/7/365.

Making use of structured data (XML, RSS, ATOM, PUSHed updates …) enables search engines to index fresh content, as well as reputation and other off-page signals, on the fly. Many ranking signals can be gathered from these data streams and their context, others are already on file. Even if a feed item consists of just a few words and a link (e.g. a tweet or stumble-thumbs-up), processing the relevant on-the-page stuff from the link’s final destination by parsing cluttered HTML to extract content and recommendations (links) doesn’t really slow down the process.

Later on, when the formerly fresh stuff starts to decompose on the SERPs, signals extracted from HTML sources kick in, for example link condoms, link placement, context and so on. Starting with the discovery of a piece of content, search engines permanently refine their scoring, until the content finally drops out of scope (spam filtering, unpaid hosting bills and other events that make Web content disappear).

Traditional discovery crawling, indexing, and ranking doesn’t exactly work for real-time purposes, nor in near real-time. Not even when a search engine assigns a bazillion of computers to this task. Also, submission based crawling is not exactly a Swiss Army knife when it comes to timely content. Although XML-sitemaps were a terrific accelerator, they must be pulled for processing, hence a delay occurs by design.

Nothing is like it used to be. Change happens.

Why does this paradigm shift puts your site at risk?

Spam puts your feeds at riskAs a matter of fact, when you publish user generated content, you will get spammed. Of course, that’s bad news of yesterday. Probably you’re confident that your anti-spam defense lines will protect you. You apply link condoms to UGC link drops and all that. You remove UGC once you spot it’s spam that slipped through your filters.

Bad news is, your medieval palisade won’t protect you from a 21th century tank attack with air support. Why not? Because you’ve secured your HTML presentation layer, but not your feeds. There’s no such thing as a rel-nofollow microformat for URIs in feeds, and even condomized links transported as CDATA (in content elements) are surrounded by spammy textual content.

Feed items come with a high risk. Once they’re released, they’re immortal and multiply themselves like rabbits. That’s bad enough in case a pissed employee ‘accidently’ publishes financial statements on your company blog. It becomes worse when seasoned spammers figure out that their submissions can make it into your feeds, and be it only for a few milliseconds.

If your content management system (CMS) creates a feed item on submission, search engines –in good company with legions of Web services– will distribute it all over the InterWeb, before you can hit the delete button. It will be cached, duplicated, published and reprinted … it’s out of your control. You can’t wipe out all of its instances. Never.

Congrats. You found a surefire way to piss off both your audience (your human feed subscribers getting their feed reader flooded with PPC spam), and search engines as well (you send them weird spam signals that rise all sorts of red flags). Also, it’s not desirable to make social media services –that you rely on for marketing purposes– too suspicious (trigger happy anti-spam algos might lock away your site’s base URI in an escape-proof dungeon).

So what can you do to prevent your feeds from unwanted content?

Protect your feedsBefore I discuss advanced feed protection, let me point you to a few popular vulnerabilities you might haven’t considered yet:

  • No nay never use integers as IDs before you’re dead sure that a piece of submitted content is floral white as snow. Integer sequences produce guessable URIs. Instead, generate a UUID (aka GUID) as identifier. Yeah, I know that UUIDs make ugly URIs, but those aren’t predictable and therefore not that vulnerable. Once a content submission is finally approved, you can donate it a nice –maybe even meaningful– URI.
  • No nay never use titles, subjects or so in URIs, not even converted text from submissions (e.g. ‘My PPC spam’ ==> ‘my_ppc_spam’). Why not? See above. And you don’t really want to create URIs that contain spammy keywords, or keywords that are totally unrelated to your site. Remember that search engines do index even URIs they can’t fetch, or which they can’t refetch, at least for a while.
  • Before the final approval, serve submitted content with a “noindex,nofollow,noarchive,nosnippet” X-Robots-Tag in the HTTP header, and put a corresponding meta element in the HEAD section. Don’t rely on link condoms. Sometimes search engines ignore rel-nofollow as an indexer directive on link level, and/or decide that they should crawl the link’s destination anyway.
  • Consider serving social media bots requesting a not yet approved piece of user generated content a 503 HTTP response code. You can compile a list of their IPs and user agent names from your raw logs. These bots don’t obey REP directives, that means they fetch and process your stuff regardless whether you yell “noindex” at them or not.
  • For all burned (disapproved) URIs that were in use ensure that your server returns a 410-Gone HTTP status code, respectively perform a 301 redirect to a policy page or so to rescue link love that would get wasted otherwise.
  • Your Web forms for content submissions should be totally AJAX’ed. Use CAPTCHAs and all that. Split the submission process into multiple parts, each of them talking to the server. Reject excessively rapid walk throughs, for example by asking for something unusual when a step gets completed in a too short period of time. With AJAX calls that’s painless for the legit user. Do not accept content submissions via standard GET or POST requests.
  • Serve link builders coming from SERPs for [URL|story|link submit|submission your site’s topic] etc. your policy page, not the actual Web form.
  • There’s more. With the above said, I’ve just begun to scrape the surface of a savvy spammer’s technical portfolio. There’s next to nothing a clever programmed bot can’t mimick. Be creative and think outside the box. Otherwise the spammers will be ahead of you in no time, especially when you make use of a standard CMS.

Having said that, lets proceed to feed protection tactics. Actually, there’s just one principle set in stone:

Make absolutely sure that submitted content can’t make it into your feeds (and XML sitemaps) before it’s finally approved!

The interesting question is: what the heck is a “final approval”? Well, that depends on your needs. Ideally, that’s you releasing each and every piece of submitted content. Since this approach doesn’t scale, think of ways to semi-automate the process. Don’t fully automate it, there’s no such thing as an infallible algo. Also, consider the wisdom of the crowd spammable (voting bots). Some spam will slip through, guaranteed.

Each and every content submission must survive a probation period, whereas it will not be included in your site’s feeds. Regardless who contributed it. Stick with the four-eye principle. Here are a few generic procedures you could adapt, respectively ideas which could inspire you:

  • Queue submissions. Possible queues are Blocked, Quarantaine, Suspect, Probation, and finally Released. Define simple rules and procedures that anyone involved can follow. SOPs lack work arounds and loopholes by design.
  • Stuck content submissions from new users who didn’t participate in other ways in quarantaine. Moderate this queue and only manually release into the probation queue what passes the moderator’s heuristics. Signup-submit-and-forget is a typical spammer behavior.
  • Maintain black lists of domain names, IPs, countries, user agent names, unwanted buzzwords and so on. Use filters to arrest submissions that contain keywords you wouldn’t expect to match your site’s theme in the Blocked or Quarantaine queue.
  • On submission fetch the link’s content and analyze it, don’t stick with heuristic checks of URIs, titles and descriptions. Don’t use methods like PHP’s file_get_contents that don’t return HTTP response codes. You need to know whether a requested URI is the first one of a redirect chain, for example. Double check with a second request from another IP, preferably owned by a widely used ISP, with a standard browser’s user agent string, that provides an HTTP_REFERER, for example a Google SERP with a q parameter populated with a search term compiled from the submission’s suggested anchor text. If the returned content differs too much, set a red flag.
  • Maintain white lists, too. That’s a great way to reduce the amount of inavoidable false positives.
  • If you have editorial staff or moderators, they should get a Release to Feed  button. You can combine mod releases with a minimum number of user votes or so. For example you could define a rule like “release to feed if mod-release = true and num-trusted-votes > 10″.
  • Categorize your user’s reputation and trustworthiness. A particular number of votes from trusted users could approve a submission for feed inclusion.
  • Don’t automatically release submissions that have raised any flag. If that slows down the process, refine your flagging but don’t lower the limits.
  • With all automatted releases, for example based on votings, oops, especially based on votings, implement at least one additional sanity check. For example discard votes from new users as well as from users with a low participation history, check the sequence of votes for patterns like similar periods of time between votings, and so on.

Disclaimer: That’s just some food for thoughts. I want to make absolutely clear that I can’t provide bullet-proof anti-spam procedures. Feel free to discuss your thoughts, concerns, questions … in the comments.



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

The most sexy browsers screw your analytics

Chrome and Safari fuck with the HTTP_REFERERNow that IE is quite unusable due to the lack of websites that support its non-standard rendering, and the current FireFox version suffers from various maladies, more and more users switch to browsers that are supposed to comply to Web standards, such as Chrome, Safari, or Opera.

Those sexy user agents execute client sided scripts in lightning speed, making surfers addicted to nifty rounded corners very very happy. Of course they come with massive memory leaks, but surfers who shut down their browser every once in a while won’t notice such geeky details.

Why is that bad news for Internet marketers? Because Chrome and Safari screw your analytics. Your stats are useless with regard to bookmarkers and type-in traffic. Your referrer stats lack all hits from Chrome/Safari users who have opened your landing page in a new tab or window.

Google’s Chrome and Apple’s Safari do not provide an HTTP_REFERER. (The typo is standardized, too.)

This bug was reported in September 2008. It’s not yet fixed. Not even in beta versions.

Guess from which (optional) HTTP header line your preferred stats tool compiles the search terms to create all the cool keyword statistics? Yup, that’s the HTTP_REFERER’s query string when the visitor came from a search result page (SERP). Especially on SERPs many users open links in new tabs. That means with every searcher switching to a sexy browser your keyword analysis becomes more useless.

That’s not only an analytics issue. Many sites provide sensible functionality based on the referrer (the Web page a user came from), for example default search terms for site-search facilities gathered from SERP-referrers. Many sites evaluate the HTTP_REFERER to prevent themselves from hotlinking, so their users can’t view the content they’ve paid for when they open a link in a new tab or window.

Passing a blank HTTP_REFERER when this information is available to the user agent is plain evil. Of course lots of so-called Internet security apps do this by default, but just because others do evil that doesn’t mean a top-notch Web browser like Safari or Chrome can get away with crap like this for months and years to come.

Please nudge the developers!

Here you go. Post in this thread why you want them to fix this bug asap. Tell the developers that you can’t live with screwed analytics, and that your site’s users rely on reliable HTTP_REFERERs. Even if you don’t run a website yourself, tell them that your favorite porn site bothers you with countless error messages instead of delivering smut, just because WebKit browsers are buggy.


You can test whether your browser passes the HTTP_REFERER or not: Go to this Google SERP. On the link to this post chose “Open link in new tab” (or window) in the context menu (right click over the link). Scroll down.

Your browser passed this HTTP_REFERER: None



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

As if sloppy social media users ain’t bad enough … search engines support traffic theft

Prepare for a dose of techy tin foil hattery. [Skip rant] Again, I’m going to rant about a nightmare that Twitter & Co created with their crappy, thoughtless and shortsighted software designs: URI shorteners (yup, it’s URI, not URL).

don't get seduced by URI shortenersRecap: Each and every 3rd party URI shortener is evil by design. Those questionable services do/will steal your traffic and your Google juice, mislead and piss off your potential visitors customers, and hurt you in countless other ways. If you consider yourself south of sanity, do not make use of shortened URIs you don’t own.

Actually, this pamphlet is not about sloppy social media users who shoot themselves in both feet, and it’s not about unscrupulous micro blogging platforms that force their users to hand over their assets to felonious traffic thieves. It’s about search engines that, in my humble opinion, handle the sURL dilemma totally wrong.

Some of my claims are based on experiments that I’m not willing to reveal (yet). For example I won’t explain sneaky URI hijacking or how I stole a portion of tinyurl.com’s search engine traffic with a shortened URI, passing searchers to a charity site, although it seems the search engine I’ve gamed has closed this particular loophole now. There’re still way too much playgrounds for deceptive tactics involving shortened URIs

How should a search engine handle a shortened URI?

Handling an URI as shortened URL requires a bullet proof method to detect shortened URIs. That’s a breeze.

  • Redirect patterns: URI shorteners receive lots of external inbound links that get redirected to 3rd party sites. Linking pages, stopovers and destination pages usually reside on different domains. The method of redirection can vary. Most URI shorteners perform 301 redirects, some use 302 or 307 HTTP response codes, some frame the destination page displaying ads on the top frame, and I’ve seen even a few of them making use of meta refreshs and client sided redirects. Search engines can detect all those procedures.
  • Link appearance: redirecting URIs that belong to URI shorteners often appear on pages and in feeds hosted by social media services (Twitter, Facebook & Co).
  • Seed: trusted sources like LongURL.org provide lists of domains owned by URI shortening services. Social media outlets providing their own URI shorteners don’t hide server name patterns (like su.pr …).
  • Self exposure: the root index pages of URI shorteners, as well as other pages on those domains that serve a 200 response code, usually mention explicit terms like “shorten your URL” et cetera.
  • URI length: the length of an URI string, if less or equal 20 characters, is an indicator at most, because some URI shortening services offer keyword rich short URIs, and many sites provide natural URIs this short.

Search engine crawlers bouncing at short URIs should do a lookup, following the complete chain of redirects. (Some whacky services shorten everything that looks like an URI, even shortened URIs, or do a lookup themselves replacing the original short URI with another short URI that they can track. Yup, that’s some crazy insanity.)

Each and every stopover (shortened URI) should get indexed as an alias of the destination page, but must not appear on SERPs unless the search query contains the short URI or the destination URI (that means not on [site:tinyurl.com] SERPs, but on a [site:tinyurl.com shortURI] or a [destinationURI] search result page). 3rd party stopovers mustn’t gain reputation (PageRank™, anchor text, or whatever), regardless the method of redirection. All the link juice belongs to the destination page.

In other words: search engines should make use of their knowledge of shortened URIs in response to navigational search queries. In fact, search engines could even solve the problem of vanished and abused short URIs.

Now let’s see how major search engines handle shortened URIs, and how they could improve their SERPs.

Bing doesn’t get redirects at all

Bing 301 messed up SERPsOh what a mess. The candidate from Redmond fails totally on understanding the HTTP protocol. Their search index is flooded with a bazillion of URI-only listings that all do a 301 redirect, more than 200,000 from tinyurl.com alone. Also, you’ll find URIs that do a permanent redirect and have nothing to do with URI shortening in their index, too.

I can’t be bothered with checking what Bing does in response to other redirects, since the 301 test fails so badly. Clicking on their first results for [site:tinyurl.com], I’ve noticed that many lead to mailto://working-email-addy type of destinations. Dear Bing, please remove those search results as soon as possible, before anyone figures out how to use your SERPs/APIs to launch massive email spam campaigns. As for tips on how to improve your short-URI-SERPs, please learn more under Yahoo and Google.

Yahoo does an awesome job, with a tiny exception

Yahoo 301 somewhat OkYahoo has done a better job. They index short URIs and show the destination page, at least via their site explorer. When I search for a tinyURL, the SERP link points to the URI shortener, that could get improved by linking to the destination page.

By the way, Yahoo is the only search engine that handles abusive short-URIs totally right (I will not elaborate on this issue, so please don’t ask for detailled information if you’re not a SE engineer). Yahoo bravely passed the 301 test, as well as others (including pretty evil tactics). I so hope that MSN will adopt Yahoo’s bright logic before Bing overtakes Yahoo search. By the way, that can be accomplished without sending out spammy bots (hint2bing).

Google does it by the book, but there’s room for improvements

Google fails with meritsAs for tinyURLs, Google indexes only pages on the tinyurl.com domain, including previews. Unfortunately, the snippets don’t provide a link to the destination page. Although that’s the expected behavior (those URIs aren’t linked on the crawled page), that’s sad. At least Google didn’t fail on the 301 test.

As for the somewhat evil tactis I’ve applied in my tests so far, Google fell in love with some abusive short-URIs. Google –under particular circumstances– indexes shortened URIs that game Googlebot, having sent SERP traffic to sneakily shortened URIs (that face the searcher with huge ads) instead of the destination page. Since I’ve begun to deploy sneaky sURLs, Google greatly improved their spam filters, but they’re not yet perfect.

Since Google is responsible for most of this planet’s SERP traffic, I’ve put better sURL handling at the very top of my xmas wish list.

About abusive short URIs

Shortened URIs do poison the Internet. They vanish, alter their destination, mislead surfers … in other words they are abusive by definition. There’s no such thing as a persistent short URI!

Long time ago Tim Berners-Lee told you that URI shorteners are evil fucking with URIs is a very bad habit. Did you listen? Do you make use of shortened URIs? If you post URIs that get shortened at Twitter, or if you make use of 3rd party URI shorteners elsewhere, consider yourself trapped into a low-life traffic theft scam. Shame on you, and shame on Twitter & Co.

fight evil URI shortenersBesides my somewhat shady experiments that hijacked URIs, stole SERP positions, and converted “borrowed” SERP traffic, there are so many other ways to abuse shortened URIs. Many of them are outright evil. Many of them do hurt your kids, and mine. Basically, that’s not any search engine’s problem, but search engines could help us getting rid of the root of all sURL evil by handling shortened URIs with common sense, even when the last short URI has vanished.

Fight shortened URIs!

It’s up to you. Go stop it. As long as you can’t avoid URI shortening, roll your own URI shortener and make sure it can’t get abused. For the sake of our children, do not use or support 3rd party URI shorteners. Deprive the livelihood of these utterly useless scumbags.

Unfortunately, as a father and as a webmaster, I don’t believe in common sense applied by social media services. Hence, I see a “Twitter actively bypasses safe-search filters tricking my children into viewing hardcore porn” post coming. Dear Twitter & Co. — and that addresses all services that make use of or transport shortened URIs — put and end to shortened URIs. Now!



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Derek Powazek outed himself big-mouthed and ignorant, and why that’s a pity

Derek PowazekWith childish attacks on his colleagues, Derek Powazek didn’t do professional Web development –as an industry– a favor. As a matter of fact, Derek Powazek insulted savvy Web developers, Web designers, even search engine staff, as well as useability experts and search engine specialists, who team-up in countless projects helping small and large Web sites succeed.

I seriously can’t understand how Derek Powazek “has survived 13 years in the web biz” (source) without detailled knowledge of how things get done in Web projects. I mean, if a developer really has worked 13 years in the Web biz, he should know that the task of optimizing a Web site’s findability, crawlability, and accessibility for all user agents out there (SEO) is usually not performed by “spammers evildoers and opportunists”, but by highly professional experts who just master Web development better than the average designer, copy-writer, publisher, developer or marketing guy.

Boy, what an ego. Derek Powazek truly believes that if “[all SEO/SEM techniques are] not obvious to you, and you make websites, you need to get informed” (source). That translates to “if you aren’t 100% perfect in all aspects of Web development and Internet marketing, don’t bother making Web sites — go get a life”.

Derek PowazekWell, I consider very few folks capable of mastering everything in Web development and Internet marketing. Clearly, Derek Powazek is not a member of this elite. With one clueless, uninformed and way too offensive rant he has ruined his reputation in a single day. Shortly after his first thoughtless blog post libelling fellow developers and consultants, Google’s search result page for [Derek Powazek] is flooded with reasonable reactions revealing that Derek Powazek’s pathetic calls for ego food are factually wrong.

Of course calm and knowledgable experts in the field setting the records straight, like Danny Sullivan (search result #1 and #4 for [Derek Powazek] today) and Peter da Vanzo (SERP position #9), can outrank a widely unknown guy like Derek Powazek at all major search engines. Now, for the rest of his presence on this planet, Derek Powazek has to live with search results that tell the world what kind of an “expert” he really is (example ).

He should have read Susan Moskwa’s very informative article about reputation management on Google’s Blog a day earlier. Not that reputation management doesn’t count as an SEO skill … actually, that’s SEO basics (as well as URI canonicalization).

Dear Derek Powazek, guess what all the bright folks you’ve bashed so cleverly will do when you ask them to take down their responses to your uncalled-for dirty talk?

So what can we learn from this gratuitous debacle? Do not piss in someone’s roses when

  • you suffer from an oversized ego,
  • you’ve not the slightest clue what you’re talking about,
  • you can’t make a point with proven facts, so you’ve to use false pretences and clueless assumptions,
  • you tend to insult people when you’re out of valid arguments,
  • willy whacking is not for you, because your dick is, well, somewhat undersized.

Ok, it’s Friday evening, so I’m supposed to enjoy TGIF’s. Why the fuck am I wasting my valuable spare time writing this pamphlet? Here’s why:

Having worked in, led, and coached WebDev teams on crawlability and best practices with regard to search engine crawling and indexing for ages now, I was faced with brain amputated wannabe geniuses more than once. Such assclowns are able to shipwreck great projects. From my experience the one and only way to keep teams sane and productive is sacking troublemakers at the moment you realize they’re unconvinceable. This Powazek dude has perfectly proven that his ignorance is persistent, and that his anti-social attitude is irreversible. He’s the prime example of a guy I’d never hire (except if I’d work for my worst enemy). Go figure.


Update 2009-10-19: I consider this a lame excuse. Actually, it’s even more pathetic than the malicious slamming of many good folks in his previous posts. If Derek Powazek really didn’t know what “SEO” means in the first place, his brain farts attacking something he didn’t understand at the time of publishing his rants are indefensible, provided he was anything south of sane then. Danny Sullivan doesn’t agree, and he’s right when he says that every industry has some black sheep, but as much as I dislike comment spammers, I dislike bullshit and baseness.



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Full disclosure @ FTC

Protecting WHOM exactly?Trying to avoid an $11,000 fine in the Federal Trade Commission’s war on bloggers:

When I write about praise search engines, that’s totally paid-for because I’ve received free search results upfront.



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Search engines should make shortened URIs somewhat persistent

URI shorteners are crap. Each and every shortened URI expresses a design flaw. All –or at least most– public URI shorteners will shut down sooner or later, because shortened URIs are hard to monetize. Making use of 3rd party URI shorteners translates to “put traffic at risk”. Not to speak of link love (PageRank, Google juice, link popularity) lost forever.

SEs could rescue tiny URLsSearch engines could provide a way out of the sURL dilemma that Twitter & Co created with their crappy, thoughtless and shortsighted software designs. Here’s how:

Most browsers support search queries in the address bar, as well as suggestions (aka search results) on DNS errors, and sometimes even 404s or other HTTP response codes other than 200/3x. That means browsers “ask a search engine” when an HTTP request fails.

When a TLD is out of service, search engines could have crawled a 301 or meta refresh from a page formerly living on a .yu domain for example. They know the new address and can lead the user to this (working) URI.

The same goes for shortened URIs created ages ago by URI shortening services that died in the meantime. Search engines have transferred all the link juice from the shortened URI to the destination page already, so why not point users that request a dead short URI to the right destination?

Search engines have all the data required for rescuing short URIs that are out of service in their datebases. Not de-indexing “outdated” URIs belonging to URI shorteners would be a minor tweak. At least Google has stored attributes and behavior of all links on the Web since the past century, and most probably other search engines are operated by data rats too.

URI shorteners can be identified by simple patterns. They gather tons of inbound links from foreign domains that get redirected (not always using a 301!) to URIs on other 3rd party domains. Of course that applies to some AdServers too, but rest assured search engines do know the differences.

So why the heck didn’t Google, Yahoo/MSN Bing, and Ask offer such a service yet? I thought it’s all about users, but I might have misread something. Sigh.

By the way, I’ve recorded search engine misbehavior with regard to shortened URIs that could arouse Jack The Ripper, but that’s a completely other story.



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

The “just create compelling and useful content” lie

The compelling content lieI’m so sick of the universal answer to all SEO questions. Each and every search engine rep keeps telling me that “creating a great and useful site with compelling content will gain me all the rankings I deserve”. What a pile of bullshit. Nothing ranks without strong links. I deserve more, and certainly involving less work.

Honestly, why the heck should I invest any amount of time and even money to make others happy? It’s totally absurd to put up a great site with compelling content that’s easy to navigate and all that, just to please an ungrateful crowd of anonymous users! What a crappy concept. I don’t buy it.

I create websites for the sole purpose of making money, and lots of green. It’s all about ME! Here I said it. Now pull out the plastic. ;-)

And there’s another statement that really annoys me: “make sites for users, not for search engines”. Again, with self-serving commandments like this one search engine quality guidelines do insult my intelligence. Why should I make even a single Web page for search engines? Google, Yahoo Microsoft and Ask staff might all be wealthy folks, but from my experience they don’t purchase much porn on-line.

I create and publish Web contents to laugh all the way to the bank at the end of the day. For no other reason. I swear. To finally end the ridiculous discussion of utterly useless and totally misleading search engine guidelines, I’ll guide you step by step through a few elements of a successful website, explaining why and for whom I perform whatever, why it’s totally selfish, and what it’s worth.

In some cases that’ll be quite a bit geeky, so just skip the technical stuff if you’re a plain Internet marketer. Also, I don’t do everything by the book on Web development, so please read the invisible fine print carefully .

robots.txt

This gatekeeper prevents my sites from useless bot traffic. That goes for behaving bots at least. Others might meet a script that handles them with care. I’d rather serve human users bigger ads than wasting bandwidth for useless bots requesting shitloads of content.

.htaccess

I’m a big fan of the “one URI = one piece of content” principle. I consider gazillions of URI variations serving similar contents avoidable crap. That’s because I can’t remember complex URIs stuffed with tracking parameters and other superfluous clutter. Like the average bookmarking surfer, I prefer short and meaningful URIs. With a few simple .htaccess directives I make sure that everybody gets the requested piece of content advertising under the canonical URI.

ErrorDocuments

Error handling is important. Before I throw a 404-Not-found error to a human visitor, I analyze the request’s context (e.g. REQUEST_URI, HTTP_REFERER). If possible, I redirect the user to the page s/he should have requested then, or at least to a page with related links banner ads. Bouncing punters don’t make me any money.

HTTP headers

There’s more than HTTP response codes doable with creative headers. For example cost savings. In some cases a single line of text in an HTTP header tells the user agent more than a bunch of markup.

Sensible page titles and summaries

There’s nothing better to instantly catch the user’s interest than a prominent page title, followed by a short and nicely formatted summary that the average reader can skim in few seconds. Fast loading and eye-catching graphics can do wonders, too. Just in case someone’s scraping bookmarking my stuff, I duplicate my titles into usually invisible TITLE elements, and provide seducing calls for action in descriptive META elements. Keeping the visitor interested for more than a few seconds results in monetary opportunities.

Unique content

Writing unique, compelling, and maybe even witty product descriptions increases my sales. Those are even more attractive when I add neat stuff that’s not available from the vendor’s data feed (I don’t mean free shipping, that’s plain silly). A good (linkworthy) product page comes with practical use cases and/or otherwise well presented, not too longish outlined, USPs. Producers as well as distributors do suck at this task.

User generated content

Besides faked testimonials and the usual stuff, asking vistors questions like “Just in case you buy product X here, what will you actually do with it? How will you use it? Whom will you give it to?” leads to unique text snippets creating needs. Of course all user generated content gets moderated.

Ajax’ed “Buy now” widgets

Most probably a punter who has clicked the “add to shopping cart” link on a page nicely gathering quite a few products will not buy another one of them, if the mouse click invokes a POST request of the shopping cart script requiring a full round trip to the server. Out of sight, out of mind.

Sitemaps

Both static as well as dynamically themed sitemap pages funnel a fair amount of vistors to appropriate landing pages. Dumping major parts of a site’s structure in XML format attracts traffic, too. There’s no such thing as bad traffic, just weak upselling procedures.

Ok, that’s enough examples to bring my point home. Probably I’ve bored you to death anyway. You see, whatever I do as a site owner, I do it only for myself (inevitably accepting collateral damage like satisfied punters and compliance to search engine quality guidelines). Recap: I don’t need no stinkin’ advice restrictions from search engines.

Seriously, since you’re still awake and following my long-winded gobbledygook, here’s a goodie:

The Number One SEO Secret

Think like a search engine engineer. Why? Because search engines are as selfish as you are, or at least as you should be.

In order to make money from advertising, search engines need boatloads of traffic every second, 24/7/365. Since there are only so many searchers populating this planet, search engines rely on recurring traffic. Sounds like a pretty good reason to provide relevant search results, doesn’t it?

That’s why search engines develop high sophisticated algorithms that try to emulate human surfers, supposed to extract the most useful content from the Web’s vast litter boxes. Their engineers tweak those algos on a daily basis, factoring in judgements of more and more signals as communities, services, opportunities, behavior, and techniques used on the Web evolve.

They try really hard to provide their users with the cream of the crop. However, SE engineers are just humans like you and me, therefore their awesome algos –as well as the underlying concepts– can fail. So don’t analyze too many SERPs to figure out how a SE engineer might tick, just be pragmatic and imagine you’ve got enough SE shares to temporarily throw away your Internet marketer hat.

Anytime when you have to make a desicion on content, design, navigation or whatever, switch to search engine engineer mode and ask yourself questions like “does this enhance the visitor’s surfing experience?”, “would I as a user appreciate this?” or simply “would I buy this?”. (Of course, instead of emulating an imaginary SE engineer, you also could switch to plain user mode. The downside of the latter brain emulation is, unfortunately, that especially geeks tend to trust the former more easily.)

Bear in mind that even if search engines don’t cover particular (optimization) techniques today, they might do so tomorrow. The same goes for newish forms of content presentation etc. Eventually search engines will find a way to work with any signal. Most of our neat little tricks bypassing today’s spam filters will stop working some day.

After completing this sanity check, heal your schizophrenia and evaluate whether it will make you money or not. Eventually, that’s the goal.

By the way, the above said doean’t mean that only so-called ‘purely white hat’ traffic optimization techniques work with search engines. Actually, SEO = hex’53454F’, and that’s a pretty dark gray. :-)

Related thoughts: Optimize for user experience, but keep the engines in mind. Misleading some folks just like this pamphlet.



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Debugging robots.txt with Google Webmaster Tools

Although Google’s Webmaster Console is a really neat toolkit, it can mislead the not-that-savvy crowd every once in a while.

When you go to “Diagnostics::Crawl Errors::Restricted by robots.txt” and you find URIs that aren’t disallow’ed or even noindex’ed in your very own robots.txt, calm down.

Google’s cool robots.txt validator withdraws its knowledge of redirects and approves your redirecting URIs, driving you nuts until you check each URI’s HTTP response code for redirects (HTTP response codes 301, 302 and 307, as well as undelayed meta refreshs).

Google obeys robots.txt even in a chain of redirects. If for Google’s user agent(s) an URI given in an HTTP header’s location is disallow’ed or noindex’ed, Googlebot doesn’t fetch it, regardless the position in the current chain of redirects. Even a robots.txt block in the 5th hop stops the greedy Web robot. Those URIs are correctly reported back as “restricted by robots.txt”, Google just refuses to tell you that the blocking crawler directive origins from a foreign server.



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Getting new stuff crawled in real-time

Thanks to PubSubHubbub we’ve got another method to invite Googlebot.

Just add a link to FriendFeed, for example by twittering or bookmarking it. Once the URI hits your FriendFeed account, FriendFeed shares it with Googlebot, and both of them request it before the tweet appears on your timeline:

Date/Time Request URI IP User agent
2009–08–17 15:47:19 Your URI 38.99.68.206 / 38.99.68.206 Mozilla/5.0 (compatible; FriendFeedBot/0.1; +Http://friendfeed.com/about/bot)
2009-08-17 15:47:19 Your robots.txt 66.249.71.141 / crawl-66-249-71-141.googlebot.com Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
2009-08-17 15:47:19 Your URI 66.249.71.141 / crawl-66-249-71-141.googlebot.com Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

 

Neat. Of course there are other methods to invite search engine crawlers in lightning speed, but this one comes handy because it can run on autopilot […]. Even when the actual fetch is meant to feed Google Reader or something, you’ve got your stuff into Google’s crawling cache from where other services like the Web indexer can pick it without requesting a crawl.



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

« Previous Page  1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28  Next Page »