Crawling vs. Indexing

Sigh. I just have to throw in my 2 cents.

Crawling means sucking content without processing the results. Crawlers are rather dumb processes that fetch content supplied by Web servers answering (HTTP) requests of requested URIs, delivering those contents to other processes, e.g. crawling caches or directly to indexers. Crawlers get their URIs from a crawling engine that’s feeded from different sources, including links extracted from previously crawled Web documents, URI submissions, foreign Web indexes, and whatnot.

Indexing means making sense out of the retrieved contents, storing the processing results in a (more or less complex) document index. Link analysis is a way to measure URI importance, popularity, trustworthiness and so on. Link analysis is often just a helper within the indexing process, sometimes the end in itself, but traditionally a task of the indexer, not the crawler (high sophisticated crawling engines do use link data to steer their crawlers, but that has nothing to do with link analysis in document indexes).

A crawler directive like “disallow” in robots.txt can direct crawlers, but means nothing to indexers.

An indexer directive like “noindex” in an HTTP header, an HTML document’s HEAD section, or even a robots.txt file, can direct indexers, but means nothing to crawlers, because the crawlers have to fetch the document in order to enable the indexer to obey those (inline) directives.

So when a Web service offers an indexer directive like <meta name="SEOservice" content="noindex" /> to keep particular content out of its index, but doesn’t offer a crawler directive like User-agent: SEOservice Disallow: /, this Web service doesn’t crawl.

That’s not about semantics, that’s about Web standards.

Whether or not such a Web service can come with incredible values squeezed out of its index gathered elsewhere, without crawling the Web itself, is a completely different story.



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

15 Comments to "Crawling vs. Indexing"

  1. Sebastian on 20 October, 2008  #link

    BTW, Donna’s statement is spot on. I feel misleaded myself, although I don’t accuse Rand of intentional obfuscating. I just thought setting the records straight could be a good idea (solely WRT the post’s title!). My interpretation of all Web standards involved supports Michael’s point of view. Also, I do think that the noindex-directive is way too slow to prevent site owners from disclosing their link data properly. There’s a lot of work to do for Rand’s team.

    BTW, I’ve no personal agenda WRT to Rand or Rand’s products, but I do think the launch was done somewhat unhandily. However: folks, please keep the discussion as polite as you can. Thanks.

  2. Hobo on 20 October, 2008  #link

    It’s usually worth more than 2c, Seb. :)

  3. Sebastian on 20 October, 2008  #link

    Thx :) what about that: http://sphinn.com/story/80142#c56109a/a>? Could such a near real time opt-out option satisfy everyone?

  4. Michael VanDeMar on 20 October, 2008  #link

    Nice and concise, thanks Sebastian. :)
    As to the bots being blocked… while I agree that is a very important issue, for me the much bigger one is the misrepresentation. It’s simply something I am not cool with.

  5. Doug Heil on 21 October, 2008  #link

    Excellent stuff Sebastian. You continually amaze me at how well you can clearly state your thoughts. Wish I had that talent. :)

    But yeah; the real issue is the deception and lying that occurred at launch. The issue is the many edits of pages on the moz site after he was called out on things. The issue is the bogus way we all can opt out of the tool… won’t work as links are links and no meta tag will stop this. The issue is many “not in the know” SEO’s will still think it’s a great thing, even though it isn’t. The REAL issue a prominent SEO firm who strives for best practices and presents themselves that way by speaking at conferences and making like they are doing good, when in fact; it’s all about the money for investors and doing what they can to cover up their blatant attempts to make this tool seem greater than it really is.

    oops; sorry Sebastian; I know I can rant. lol Your post and definitions are spot on. If only the same investors would have come to me as I have so many ideas that might “right” this industry, but not the money to make them a reality.

  6. Sebastian on 21 October, 2008  #link

    Doug,

    thanks for the compliment. In the sense of my statement above I had to blur your rant. Everything you said can be found in endless variations at Sphinn and various blogs discussing the LinkScape launch. I don’t think that it’s helpful to spread repetitive rants. Rand has no chance to reply to every post on gazillions of SEO hangouts, and that doesn’t fit my understanding of fairness in a professional discussion.

    I want to keep the comments to this post on topic. The point is crawling vs. indexing, respectively whether or not it’s possible to opt out of possible crawls that feed only the LinkScape index, and whether or not the method(s) given to Webmasters are reasonable or not. BTW, Rand has announced that his team will apply changes to the current system by the end of this year. From a developers POV this timeline is reasonable, although I do think that this functionality should have been an essential part of the concept, delivered with the very first product release. For example a refetch of outdated robots.txt files to check for (new) exclusionary statements before an index update should be doable in a system of that size.

    Thanks for your understanding.
    Sebastian

  7. Doug Heil on 21 October, 2008  #link

    Yep.

    @Sebastian; you wrote:
    “and whether or not the method(s) given to Webmasters are reasonable or not”

    Any method having to do advertising SEOMOZ in all pages head tag’s is not reasonable at all. It’s a “linkage” tool, so meta tag blocking is not possible. He knows that.

    Besides all of that; did you see his latest statement about his tool and methods? He stated he was a rogue bot with not much anyone can do about it. I paraphrased.

    But anyway; he chooses to run his firm in this fashion, so we can choose how we wish to continue to view his firm. The industry will decide as a whole.

    Again Sebastian; thanks for putting the definitions out there extremely concise and clear.

  8. Sebastian on 21 October, 2008  #link

    The meta tag method is nice to have, but not practically, IOW more or less propaganda. I’m not excited to add a shitload of daily code bloat to the HEAD section of all my pages just because someone releases a new tool. Actually, I won’t do it except for indexing purposes addressing major search engines, and I’m in good company.

    Frankly, I don’t care about the data they publish, those are available elsewhere for free. Technically spoken, the offered on-page ‘noindex’ directive offers a way to opt out partially, but that’s IMHO just a legal thingy. BTW the same legal butt covering that all major engines do practice since Web search exists, without much whining across the boards. So that’s not the point.

    The point is that a SEO company should be way more sensible than a search engine when it comes to Webmaster concerns. Part of the canonical way to launch such SEO tools is a simple procedure to opt-out without hassles, and in a timely manner. Probably very few folks would actually opt out when such an option comes with the first product release. Now every dog and its fleas, well, flee.

  9. Doug Heil on 21 October, 2008  #link

    Goodness; is that so very true!

    I could give a rats arse what kind of data the tool shows for me OR my clients, but that is not the point at all. It’s one SEO’s opinion about link data anyway, so I don’t care about that. If the firm had been totally honest and not deceitful at all about what they were doing and had been doing for quite awhile now, not much would have been said, other than the fact the tool is really a rogue scraper. The fact his firm promotes themselves as do-gooders to the SEO industry and best practices, etc, makes what he has done and how they did it, all the more pathetic.

    [Again, off-topic rants blurred …]

  10. […] not SEOmoz has their own crawler/spider. All the word play is confusing. If you haven’t yet, this explanation of Crawling vs. Indexing clarifies the terms […]

  11. […] to do it properly. Note, however, and I don’t care what anyone else has told you…. Your robots.txt WILL NOT prevent indexing or remove indexed pages. To acheive this you will need a noindex tag in the page in […]

  12. […] I appreciate Google’s brand new News User Agent. It is, however, not a perfect solution, because it doesn’t distinguish indexing and crawling. […]

  13. […] 63 million root index pages, carrying 700 billion links”. 13 links per page is plausible. Crawling 55 billion URIs requires sending out HTTP GET requests to fetch 55 billion Web resources within 45 […]

  14. […] them from appearing in Google’s search results, then you need to study the difference between crawling and indexing. The only thing I found interesting about this particular SERP listing is the title. As far as I […]

  15. eUKhost on 23 November, 2011  #link

    Thanks for explaining the difference between Crawling and Indexing nicely. The way you have explained it is very easy and helpful.

Leave a reply


[If you don't do the math, or the answer is wrong, you'd better have saved your comment before hitting submit. Here is why.]

Be nice and feel free to link out when a link adds value to your comment. More in my comment policy.