Monthly archive: April, 2007

Is XML Sitemap Autodiscovery for Everyone?

Referencing XML sitemaps in robots.txt was recently implemented by Google upon requests of webmasters going back to June, 2005, shortly after the initial launch of sitemaps. Yahoo, Microsoft, and Ask support it, whereby nobody knows when MSN is going to implement XML sitemaps at all.

Some folks argue that robots.txt introduced by the Robots Exclusion Protocol from 1994 should not get abused by inclusion mechanisms. Indeed this may create confusion, but it was done before, for example by search engines supporting Allow: statements introduced 1996. Also, the de facto Robots Exclusion Standard covers robots meta tags -where inclusion is the default- too. I think dogmatism is not helpful when the actual needs require evolvement.

So yes, the opportunity to address sitemaps in robots.txt is a good thing, but certainly not enough. It simplifies the process, that is auto detection of sitemaps eliminates a few points of failure. Webmasters don’t need to monitor which engines implemented the sitemaps protocol recently, and submit accordingly. They can just add a single line to their robots.txt file and the engines will do their job. Fire and forget is a good concept. However, the good news come with pitfalls.

But is this good thing actually good for everyone? Not really. Many publishers have no control over their server’s robots.txt file, for example publishers utilizing signup-and-instantly-start-blogging services or free hosts. As long as these platforms generate RSS feeds or other URL lists suitable as sitemaps, the publishers must submit to all search engines manually. Enhancing the sitemaps auto detection by looking at page meta data would be great: <meta name="sitemap" content="http://www.example.com/sitemap.xml" /> or <link rel="sitemap" type="application/rss+xml" href="http://www.example.com/sitefeed.rss" /> would suffice.

So far the explicit diaspora. Others are barred from sitemap autodiscovery by lack of experience, technical skills, or manageable environments like at way to restrictive hosting services. Example: the prerequisites for sitemap autodetection include the ability to fix canonical issues. An XML sitemap containing www.domain.tld-URLs referenced as Sitemap: http://www.domain.tld/sitemap.xml in http://domain.tld/robots.txt is plain invalid. Crawlers following links without the “www” subdomain will request the robots.txt file without the “www” prefix. If a webmaster running this flawed but very common setup relies on sitemap autodetection, s/he will miss out on feedback respectively error alerts. On some misconfigured servers this may even lead to deindexing of all pages with relative internal links.

Hence please listen to Vanessa Fox stating that webmasters shall register their autodiscovered sitemaps at Webmaster Central and Site Explorer to get alerted on errors which an XML sitemap validator cannot spot, and to monitor the crawling process!

I doubt many SEO professionals and highly skilled Webmasters managing complex sites will make use of that new feature. They prefer to have things under control, and automated 3rd party polls are hard to manipulate. Probably they want to maintain different sitemaps per engine to steer their crawling accordingly. Although this can be accomplished by user agent based delivery of robots.txt, that additional complexity doesn’t make the submission process easier to handle. Only uber-geeks automate everything ;)

For example it makes no sense to present a gazillion of image- or video clip URLs to a search engine indexing textual contents only. Google handles different content types extremely simple for the site owner. One can put HTML pages, images, movies, PDFs, feeds, office documents and whatever else all in one sitemap and Google’s sophisticated crawling process delivers each URL to the indexer it belongs to. We don’t know (yet) how other engines will handle that.

Also, XML sitemaps are a neat instrument to improve crawling and indexing of particular contents. One search engine may nicely index insuffient linked stuff, whilst another engine fails to discover pages buried more than two link levels deep, badly needing the hints via sitemap. There are more good reasons to give each engine its own sitemap.

Last but not least there might be good reasons not to announce sitemap contents to the competition.

Tags: Search Engine Optimization (SEO) XML Sitemap Auto-discovery



Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

In need of a "Web-Robot Directives Standard"

The Robots Exclusion Protocol from 1994 gets used and abused, best described by Lisa Barone citing Dan Crow from Google: “everyone uses it but everyone uses a different version of it”. De facto we’ve a Robots Exclusion Standard covering crawler directives in robots.txt and robots meta tags as well, said Dan Crow. Besides non-standardized directives like “Allow:”, Google’s Sitemaps Protocol adds more inclusion to the mix, now even closely bundled with robots.txt. There are more ways to put crawler directives. Unstructured (in the sense of independence from markup elements) like with Google’s section targeting, on link level applying the commonly disliked rel-nofollow microformat or XFN, and related thoughts on block level directives.

All in all that’s a pretty confusing conglomerate of inclusion and exclusion, utilizing many formats respectively markup elements, and lots of places to put crawler directives. Not really the sort of norm the webmaster community can successfully work with. No wonder that over 75,000 robots.txt files have pictures in them, that less than 35 percent of servers have a robots.txt file, that the average robots.txt file is 23 characters (”User-agent: * Disallow:”), gazillions of Web pages carry useless and unsupported meta tags like “revisit-after” … for more funny stats and valuable information see Lisa’s robots.txt summit coverage (SES NY 2007), also covered by Tamar (read both!).

How to structure a “Web-Robot Directives Standard”?

To handle redundancies as well as cascading directives properly, we need a clear and understandable chain of command. The following is just a first idea off the top of my head, and likely gets updated soon:

  • Robots.txt

    1. Disallows directories, files/file types, and URI fragments like query string variables/values by user agent.
    2. Allows sub-directories, file names and URI fragments to refine Disallow statements.
    3. Gives general directives like crawl-frequency or volume per day and maybe even folders, and restricts crawling in particular time frames.
    4. References general XML sitemaps accessible to all user agents, and specific XML sitemaps addressing particular user agents as well.
    5. Sets site-level directives like “noodp” or “noydir”.
    6. Predefines page-level instructions like “nofollow”, “nosnippet” or “noarchive” by directory, document type or URL fragments.
    7. Predefines block-level respectively element-level conditions like “noindex” or “nofollow” on class names or DOM-IDs by markup element. For example “DIV.hMenu,TD#bNav ‘noindex,nofollow’” could instruct crawlers to ignore the horizontal menu as well as navigation at the very bottom on all pages.
    8. Predefines attribute-level conditions like “nofollow” on A elements. For example “A.advertising REL ‘nofollow’” could tell crawlers to ignore links in ads, or “P#tos > A ‘nofollow’” could instruct spiders to ignore links in TOS excerpts found on every page in a P element with the DOM-ID “tos”.
    • XML Sitemaps

      1. Since robots.txt deals with inclusion now, why not add an optional URL specific “action” element allowing directives like “nocache” or “nofollow”? Also a “delete” directive to get outdated pages removed from search indexes would make sound sense.
      2. To make XML sitemap data reusable, and to allow centralized maintenance of page meta data, a couple of new optional URL elements like “title”, “description”, “document type”, “language”, “charset”, “parent” and so on would be a neat addition. This way it would be possible to visualize XML sitemaps as native (and even hierarchical) site maps.

      Robots.txt exclusions overrule URLs listed for inclusion in XML sitemaps.

      • Meta Tags

      • Page meta data overrule directives and information provided in robots.txt and XML sitemaps. Empty contents in meta tags suppress directives and values given in upper levels. Non-existent meta tags implicitly apply data and instructions from upper levels. The same goes for everything below.
        • Body Sections

        • Unstructured parenthesizing of parts of code certainly is undoable with XMLish documents, but may be a pragmatic procedure to deal with legacy code. Paydirt in HTML comments may be allowed to mark payload for contextual advertising purposes, but it’s hard to standardize. Lets leave that for proprietary usage.
          • Body Elements

          • Implementing a new attribute for messages to machines should be avoided for several good reasons. Classes are additive, so multiple values can be specified for most elements. That would allow to put standarized directives as class names, for example class=”menu robots-noindex googlebot-nofollow slurp-index-follow” where the first class addresses CSS. Such inline robot directives come with the same disadvantages as inline style assignments and open a can of worms so to say. Using classes and DOM-IDs just as a reference to user agent specific instructions given in robots.txt is surely the preferable procedure.
            • Element Attributes

            • More or less this level is a playground for microformats utilizing the A element’s REV and REL attributes.

Besides the common values “nofollow”, “noindex”, “noarchive”/”nocache” etc. and their omissible positive defaults “follow” and “index” etc., we’d need a couple more, for example “unapproved”, “untrusted”, “ignore” or “skip” and so on. There’s a lot of work to do.

In terms of of complexity, a mechanism as outlined above should be as easy to use as CSS in combination with client sided scripting for visualization purposes.

However, whatever better ideas are out there, we need a widely accepted “Web-Robot Directives Standard” as soon as possible.

Tags: Search Engine Optimization (SEO) XML Sitemaps Meta Tags Crawler Directives



Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

XML sitemap auto-discovery

Vanessa Fox makes the news: In addition to sitemap submissions you can add this line to your robots.txt file:

Sitemap: http://www.example.com/sitemap.xml

Google, Yahoo, MSN and Ask (new on board) then should fetch and parse the XML sitemap automatically. Next week or so the cool robots.txt validator will get an update too.

Question: Is XML Sitemap Autodiscovery for Everyone?

Update:
More info here.

Q: Does it work by user-agent?
A: Yes, add all sitemaps to robots.txt, then disallow by engine.

Q: Must I fix canonical issues before I use sitemap autodiscovery?
A: Yes, 301-redirect everything to your canonical server name, and choose a preferred domain at Webmaster Central.

Q: Can I submit all supported sitemap formats via robots.txt?
A: Yes, everything goes. XML, RSS, ATOM, plain text, Gzipped …

Tags: Search Engine Optimization (SEO) Google Sitemaps Yahoo MSN Ask



Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Better don’t run a web server under Windows

IIS defaults can produce serious troubles with search engines. That’s a common problem and not even all .nhs.uk (UK Government National
Health Service) admins have spotted it. I’ve alerted the Whipps Cross University Hospital but can’t email all NHS sites suffering from IIS and lazy or uninformed webmasters. So here’s the fix:

Create a server without subdomain domain.nhs.uk, then go to the “Home Directory” tab and click the option “Redirection to a URL”. As “Redirect to” enter the destination, for example “http://www.domain.nhs.uk$S$Q”, without a slash after “.uk” because the path ($S placeholder) begins with a slash. The $Q placeholder represents the query string. Next check “Exact URL entered above” and “Permanent redirection for this resource”, and submit. Test the redirection with a suitable tool.

Now when a user enters a URL without the “www” prefix s/he gets the requested page from the canonical server name. Also search engine crawlers following non-canonical links like http://whippsx.nhs.uk/ will transmit the link love to the desired URL, and will index more pages instead of deleting them in their search indexes after a while because the server is not reachable. I’m not joking. Under some circumstances all or many www-URLs of pages referenced by relative links resolving to the non-existent server will get deleted in the search index after a couple of unsuccessfull attempts to fetch them without the www-prefix.

Hat tip to Robbo
Tags: Search Engine Optimization (SEO) IIS National Health Service (NHS) UK



Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Yahoo Pipes jeopardizes the integrity of the Internet

Update: This post, initially titled “No more nofollow-insane at Google Reader”, then updated as “(No) more nofollow-insane at Google Reader”, accused Google Reader of inserting nofollow crap. I apologize for my lazy and faulty bug report. Read the comments.

I fell in love with Yahoo pipes because that tool allowed me to funnel the tidbits contained in a shitload of noise into a more or less clear signal. Instead of checking hundreds of blog feeds, search query feeds and whatever else, I was able to feed my preferred reader with actual payload extracted from vast loads of paydirt digged from lots of sources.

Now that I’ve learned that Yahoo pipes is evil I guess I must code the filters myself. Nofollow insane is not acceptable. Nofollow madness jeopardizes the integrity of the Internet which is based on free linkage. I don’t need no stinking link condoms sneakily forced by nice looking tools utilizing nifty round corners. I’ll be way happier with a crappy and uncomfortable PHP hack feeded with OPML files and conditions pulled from a manually edited MySQL table.

Here is the evidence right from the Yahoo pipe output:
Also, abusing my links with target=”_blank” is not nice.


Initial post and its first update:

I’m glad Google has removed the auto-nofollow on links in blog posts. When I add a feed I trust its linkage and I don’t need no stinking condoms on pages nobody except me can see unless I share them. Thanks!

Update - Nick Baum, can you hear me?

It seems the nofollow-madness is not yet completely buried. Here is a post of mine and what Google Reader shows me when I add my blog’s feed:
Click to enlarge
And here is the same post filtered thru a Yahoo pipe:
Click to enlarge
So please tell me: why does Google auto-nofollow a link to Vanessa Fox when she gets linked via Yahoo, and uncondomizes the link from Google’s very own blogspot dot dom? Curious …



Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

5 Reasons why I blog

So since Matt Cutts tagged by Vanessa Fox cat-tagged me 5 times ;) I add my piece.

    Napping cats don't listen
  1. Well, I’ve started this blog because every dog and his grandpa blogs, but the actual reason was, that I couldn’t convince my beloved old cat listening my rants any more. Sadly my old comrade died years ago in the age of 15, leaving alone a gang of two legged monsters rampage in house and garden.
  2. Since then I’ve used my blogs for kidding, bollocks, and other stuff not suitable for more or less static sites where I publish more seriously. However, I’ve scraped some wholehearted posts from the blog to put them on the consulting platform, because this site is way more popular. Vice versa I’ve blogged announced my other articles and projects here. This blog is somewhat a playground to test the waters and concurrently a speaking tube. I still find it difficult to do that with another platform, the timely character of blogging perfectly allows burying of half-baked things.
  3. Every now and then I write an open letter to Google, for example my series of pleas to revamp rel=nofollow. Perhaps a googler is listening ;)
    Also, a blog is a neat instrument to get the attention of folks who don’t seem to listen.
  4. Frankly I like to share ideas and knowledge. Blogging is the perfect platform to raise rumors or myths too. Also, writing helps me to structure my thoughts, this works even better in a foreign language.
  5. Last but not least I use my blog as reference. While providing Google user support sometimes I just drop a link, particulary as answer to repetitive questions. By the way Google’s Webmaster Forum is a nice place to chase SEO tidbits straight from the horse’s mouth.

Although I admit I’ve somewhat tag-baited my way in here, I’m tagging you:
Thu Tu
John Müller
John Honeck
Jim Boykin
Gurtie & Chris



Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Four reasons to get tanked on Google’s SERPs

You know I find “My Google traffic dropped all of a sudden - I didn’t change anything - I did nothing wrong” threads fascinating. Especially posted with URLs on non-widgetized boards. Sometimes I submit an opinion, although the questioners usually don’t like my replies, but more often I just investigate the case for educational purposes.

Abstracting a fair amount of tanked sites I’d like to share a few of my head notes respectively theses as food for thoughts. I’ve tried to put these as generalized as possible, so please don’t blame me for the lack of a detailed explanation.

  1. Reviews and descriptions ordered by product category, product line, or other groupings of similar products, tend to rephrase each other semantically, that is in form and content. Be careful when it comes to money making areas like travel or real estate. Stress unique selling points, non-shared attributes or utilizations, localize properly and make sure reviews respectively descriptions don’t get spread in-full internally on crawlable pages as well as externally.
  2. Huge clusters of property/characteristic/feature lists under analogical headings, even unstructured, may raise a flag when the amount of applicable attributes is finite and values are rather similar with just few of them totally different respectively expressions of trite localization.
  3. The lack of non-commercial outgoing links on pages plastered with ads of any kind, or pages at the very buttom of the internal linking hierarchy, may raise a flag. Nofollow’ing, redirecting or iFraming affiliate/commercial links doesn’t prevent from breeding artificial page profiles. Adding unrelated internal links to the navigation doesn’t help. Adding Wikipedia links in masses doesn’t help. Providing unique textual content and linking to authorities within the content does help.
  4. Strong and steep hierarchical internal/navigational linkage without relevant crosslinks and topical between-the-levels linkage looks artificial, especially when the site in question lacks deep links. Look at the ratio of home page links vs. deep links to interior pages. Rethink the information architecture and structuring.

Take that as call for antithesis or just stuff for thoughts. And keep in mind that although there might be no recent structural/major/SEO/… on-site changes, perhaps Google just changed her judgement on the ancient stuff ranking forever, and/or has just changed the ability of existing inbound links to pass weight. Nothing’s set in stone. Not even rankings.

Tags: Search Engine Optimization (SEO) Google 30+ Penalty 950+ Penalty MSSA Penalties



Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Link monkey business is not worth a whoop

Old news, pros move on educating the great unlinked.

A tremendous amount of businesses maintaining a Web site still swap links in masses with every dog and his fleas. Serious sites join link exchange scams to gain links from every gambling spammer out there. Unscrupulous Web designers and low-life advisors put gazillions of businesses at risk. Eventually the site owners pop up in Google’s help forum wondering why the heck they lost their rankings despite their emboldening toolbar PageRank. Told to dump all their links pages and to file a reinclusion request they may do so, but cutting one’s loss short term is not the way the cookie crumbles with Google. Consequences of listening to bad SEO advice are often layoffs or even bust.

In this context a thread titled “Do the companies need to hire a SEO to get in top position?” asks the somewhat right question but may irritate site owners even more. Their amateurish Web designer offering SEO services obviously got their site banned or at least heavily penalized by Google. Asking for help in forums they get contradictory SEO advice. Google’s take on SEO firms is more or less a plain warning. Too many scams sailing under the SEO flag and it seems there’s no such thing as reliable SEO advice for free on the net.

However, the answer to the question is truly “yes“. It’s better to see a SEO before the rankings crash out. Unfortunately, SEO is not a yellow pages category, and every clown can offer crappy SEO services. Places like SEO Consultants and honest recommendations get you the top notch SEOs, but usually the small business owner can’t afford their services. Asking fellow online businesses for their SEO partner may lead to a scammer who is still beloved because Google has not yet spotted and delisted his work. Kinda dilemma, huh?

Possible loophole: once you’ve got a recommendation for a SEO skilled Webmaster or SEO expert from somebody attending a meeting at the local chamber of commerce, post that fellow’s site to the forums and ask for signs of destructive SEO work. Should give you an indication of trustworthiness.

Tags: Search Engine Optimization (SEO) Google Linkage Bans & Penalties



Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Supplemental-Only

Nice closing words on “my stuff went supplemental” from JohnWeb.

Applying simplified conclusions to a complex SEO question reveal 20% of the truth whilst 80% are just not worth discussing because the efforts necessary to analyze one more percent equal the 20% analysis. The alternative is working with 20% reasonable conclusions plus 80% common sense.

Unfortunately, common sense is not as common as you might think. Just count the supplemental-threads across the board, then search for words of wisdom. Sigh.

Tags: Search Engine Optimization (SEO) Google supplemental index
Update: Read Matt’s Google Hell



Share/bookmark this: del.icio.us • Google • ma.gnolia • MixxNetscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

« Previous Page  1 | 2