Does your built-in bullshit detector cry in agony when you read announcements of link analysis tools claiming to have crawled Web pages in the trillions? Can a tiny SEO shop, or a remote search engine in its early stages running on donated equipment, build an index of that size? It took Google a decade to reach these figures, and Google’s webspam team alone outnumbers the staff of SEOmoz and Majestic, not to speak of infrastructure.
Well, it’s not as shady as you might think, although there’s some serious bragging and willy whacking involved.
First of all, both SEOmoz and Majestic do not own an indexed copy of the Web. They process markup just to extract hyperlinks. That means they parse Web resources, mostly HTML pages, to store linkage data. Once each link and its attributes (HREF and REL values, anchor text, …) are stored under a Web page’s URI, the markup gets discarded. That’s why you can’t search these indexes for keywords. There’s no full text index necessary to compute link graphs.
The storage requirements for the Web’s link graph are way smaller than for a full text index that major search engines have to handle. In other words, it’s plausible.
Majestic clearly describes this process, and openly tells that they index only links.
With SEOmoz that’s a completely different story. They obfuscate information about the technology behind LinkScape to a level that could be described as near-snake-oil. Of course one could argue that they might be totally clueless, but I don’t buy that. You can’t create a tool like LinkScape being a moron with an IQ slighly below an amoeba. As a matter of fact, I do know that LinkScape was developed by extremely bright folks, so we’re dealing with a misleading sales pitch:
Let’s throw in a comment at Sphinn, where a SEOmoz rep posted “Our bots, our crawl, our index“.
Of course that’s utter bullshit. SEOmoz does not have the resources to accomplish such a task. In other words, if –and that’s a big IF– they do work as described above, they’re operating something extremely sneaky that breaks Web standards and my understanding of fairness and honesty. Actually, that’s not so, but because it is not so, LinkScape and OpenSiteExplorer in its current shape must die (see below why).
They do insult your intelligence as well as mine, and that’s obviously not the right thing to do, but I assume they do it solely for marketing purposes. Not that they need to cover up their operation with a smokescreen like that. LinkScape could succeed with all facts on the table. I’d call it a neat SEO tool, if it just would be legit.
So what’s wrong with SEOmoz’s statements above, and LinkScape at all?
Let’s start with “Crawled in the past 45 days: 700 billion links, 55 billion URLs, 63 million root domains”. That translates to “crawled … 55 billion Web pages, including 63 million root index pages, carrying 700 billion links”. 13 links per page is plausible. Crawling 55 billion URIs requires sending out HTTP GET requests to fetch 55 billion Web resources within 45 days, that’s roughly 30 terabyte per day. Plausible? Perhaps.
True? Not as is. Making up numbers like “crawled 700 billion links” suggests a comprehensive index of 700 billion URIs. I highly doubt that SEOmoz did ‘crawl’ 700 billion URIs.
When SEOmoz would really crawl the Web, they’d have to respect Web standards like the Robots Exclusion Protocol (REP). You would find their crawler in your logs. An organization crawling the Web must
- do that with a user agent that identifies itself as crawler, for example “Mozilla/5.0 (compatible; Seomozbot/1.0; +http://www.seomoz.com/bot.html)”,
- fetch robots.txt at least daily,
- provide a method to block their crawler with robots.txt,
- respect indexer directives like “noindex” or “nofollow” both in META elements as well as in HTTP response headers.
SEOmoz obeys only
<META NAME="SEOMOZ" CONTENT="NOINDEX" />, according to their sources page. And exactly this page reveals that they purchase their data from various services, including search engines. They do not crawl a single Web page.
Savvy SEOs should know that crawling, parsing, and indexing are different processes. Why does SEOmoz insist on the term “crawling”, taking all the flak they can get, when they obviously don’t crawl anything?
Two claims out of three in “Our bots, our crawl, our index” are blatant lies. If SEOmoz performs any crawling, in addition to processing bought data, without following and communicating the procedure outlined above, that would be sneaky. I really hope that’s not happening.
As a matter of fact, I’d like to see SEOmoz crawling. I’d be very, very happy if they would not purchase a single byte of 3rd party crawler results. Why? Because I could block them in robots.txt. If they don’t access my content, I don’t have to worry whether they obey my indexer directives (robots meta ‘tag’) or not.
As a side note, requiring a “SEOMOZ” robots META element to opt out of their link analysis is plain theft. Adding such code bloat to my pages takes a lot of time, and that’s expensive. Also, serving an additional line of code in each and every HEAD section sums up to a lot of wasted bandwidth –$$!– over time. Am I supposed to invest my hard earned bucks just to prevent me from revealing my outgoing links to my competitors? For that reason alone I should report SEOmoz to the FTC requesting them to shut LinkScape down asap.
They don’t obey the X-Robots-Tag (”noindex”/”nofollow”/… in the HTTP header) for a reason. Working with purchased data from various sources they can’t guarantee that they even get those headers. Also, why the fuck should I serve MSNbot, Slurp or Googlebot an HTTP header addressing SEOmoz? This could put my search engine visibility at risk.
If they’d crawl themselves, serving their user agent a “noindex” X-Robots-Tag and a 403 might be doable, at least when they pay for my efforts. With their current setup that’s technically impossible. They could switch to 80legs.com completely, that’ll solve the problem, provided 80legs works 100% by the REP and crawls as “SEOmozBot” or so.
With MajesticSEO that’s not an issue, because I can block their crawler with
Yahoo’s site explorer also delivers too much data. I can’t block it without losing search engine traffic. Since it will probably die when Microsoft overtakes search.yahoo.com, I don’t rant much about it. Google and Bing don’t reveal my linkage data to everyone.
I have an issue with SEOmoz’s LinkScape, and OpenSiteExplorer as well. It’s serious enough that I say they have to close it, if they’re not willing to change their architecture. And that has nothing to do with misleading sales pitches, or arrogant behavior, or sympathy (respectively, a possibly lack of sympathy).
The competitive link analysis OpenSiteExplorer/LinkScape provides, without giving me a real chance to opt out, puts my business at risk. As much as I appreciate an opportunity to analyze my competitors, vice versa it’s downright evil. Hence just kill it.
Is my take too extreme? Please enlighten me in the comments.
Update: A follow-up post from Michael VanDeMar and its Sphinn discussion, the first LinkScape thread at Sphinn, and Sphinn comments to this pamphlet.