Archived posts from the 'Recommendations' Category

Dear webmaster, don’t trust Google on links

I’m not exactly a fan of non-crawlable content, but here it goes, provided by Joshua Titsworth (click on a tweet in order to favorite and retweet it):

What has to be said

Google’s link policy is ape shit. They don’t even believe their own FUD. So don’t bother listening to crappy advice anymore. Just link out without fear and encourage others to link to you fearlessly. Do not file reinclusion requests in the moment you hear about a Googlebug on your preferred webmaster hangout, because you might have spread a few shady links by accident, and don’t slaughter links you’ve created coz the almighty Google told you so.

Just because a bazillion of flies can’t err, that doesn’t mean you’ve to eat shit.



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

About time: EU crumbles monster cookies from hell

Some extremely bright bravehearts at the European Union headquarters in Bruxelles finally took the initiative and launched a law to fight the Interweb’s gremlins, that play down their sneaky and destructive behavior behind an innocent as well as inapplicable term: COOKIE.

Back in the good old days when every dog and its fleas used an InternetExplorer to consume free porn, and to read unbalanced left leaning news published by dubious online tabloids based in communist strongholds, only the Vatican spread free cookies to its occasional visitor. Today, every site you visit makes use of toxic cookies, thanks to Google Web Analytics, Facebook, Amazon, eBay and countless smut peddlers.

Not that stone age browsers like IE6 could handle neat 3rd party cookies (that today’s advertising networks use to shove targeted product news down your throat) without a little help from an 1×1 pixel iFrame at all, but that’s a completely other story. The point I want to bring home is: cookies never were harmless at all!

Quite the opposite is true. As a matter of fact, Internet cookies pose as digestible candies, but once swallowed they turn into greedy germs that produce torturous flatulence, charge your credit card with overprized Rolex® replicas and other stuff you really don’t need, and spam all your email accounts for the time being until you actually need your daily dose of Viagra® to handle all the big boobs and stiff enlarged dicks delivered to your inbox.

Now that you’re well informed on the increasing cookie pest, a.k.a. cookie pandemic, I’m dead sure you’ll applaud the EU anti cookie law that’s going to get enforced by the end of May 2012, world-wide. Common sense and experience of life tells us, that indeed local laws can tame the Wild Wild West (WWW).

Well, at least in the UK, so far. That’s quite astonishing by the way, because usually the UK vetoes or boycotts everything EU, until their lowbrow thinking and underpaid lawyers discover that previous governments already have signed some long-forgotten contracts defining EU regulations as binding for tiny North Sea islands, even if they’re located somewhere in the Atlantic Ocean and consider themselves huge.

Anyway, although literally nobody except a few Web savvy UK webmasters (but not even its creators who can’t find their asshole with both hands fumbling in the dark) know what the fuck this outlandish law is all about, we need to comply. For the sake of our unborn children, civic duty, or whatever.

Of course you can’t be bothered with researching all this complex stuff. Unfortunately, I can’t link to authorative sources, because not even the almighty Google told me how alien webmasters can implement a diffuse EU policy that didn’t make it to the code of law of any EU member state yet (except of the above mentioned remote islands, though even those have no fucking clue with regard to reasonable procedures and such). That makes this red crab the authorative source on everything ‘EU cookie law’. Sigh.

So here comes the ultimative guide for this planet’s webmasters who’d like to do business with EU countries (or suffer from an EU citizenship).

Step 1: Obfuscate your cookies

In order to make your most sneaky cookies undetectable, flood your vistor’s computer with a shitload of randomly generated and totally meaningless cookies. Make sure that everything important for advertising, shopping cart, user experience and so on gets set first, because the 1024th and all following cookies face the risk of getting ignored by the user agent.

Do not use meaningful variable names for cookies and decode all values. That is, instead of setting added_not_exactly_willingly_purchased_items_to_shopping_cart[13] = golden_humvee_with_diamond_break_pads just create an unobtrusive cookie like additional_discount_upc_666[13] = round(99.99, 0) + '%'.

Step 2: Ask your visitors for permission to accept your cookies

When a new visitor hits your site, create a hidden popunder window with a Web form like this one:


Of course

Why not

Yes, and don’t ask me again

Yup, get me to the free porn asap

I’ve read the TOS and I absolutely agree


 

Don’t forget to test the auto-submit functionality with all user agents (browsers) out there. Also, log the visitor’s IP addy, browser version and such stuff. Just in case you need to present it in a lawsuit later on.

Step 3: Be totally honest and explain every cookie to your visitors

Somewhere on a deeply buried TOS page linked from your privacy policy page that’s no-followed across your site with an anchor text formatted in 0.001pt, create an ugly table like this one:

My awesome Web site’s wonderful cookies:
_preg=true This cookie makes you pregnant. Also, it creates an order for 100 diapers, XXS, assorted pink and blue, to be delivered in 9 months. Your PayPal account (taken from a befriended Yahoo cookie) gets charged today.
_vote_rig=conditional If you’ve participated in a poll and your vote doesn’t match my current mood, I’ll email your mother in law that you’re cheating on your spouse. Also, regardless what awkward vote you’ve submitted, I’ll change it in a way that’s compatible with my opinion on the topic in question.
_auto_start=daily Adds my product of the day page to your auto start group. Since I’ve collected your credit card details already, I’m nice enough to automate the purchase process in an invisible browser window that closes after I’ve charged your credit card. If you dare to not reboot your pathetic computer at least once a day, I’ll force an hourly reboot in order to teach you how the cookie crumbles.
_joke=send If you see this cookie, I found a .pst file on your computer. All your contacts will enjoy links to questionable (that is NotSafeAtWork) jokes delivered by email from your account, often.
_boobs=show If you’re a male adult, you’ve just subscribed to my ‘weird boob job’ paysite.
_dicks=show That’s the female version of the _boobs cookie. Also delivered to my gay readers, just the landing page differs a little bit.
_google=provided You were thoughtless enough to surf my blog while logged into your Google account. You know, Google just stole my HTTP_REFERER data, so in revenge I overtook your account in order to gather the personal and very private information the privacy nazis at Google don’t deliver for free any more.
_twitter=approved Just in case you check out your Twitter settings by accident, do not go to the ‘Apps’ page and do not revoke my permissions. The few DMs I’ve sent to all your followers only feed my little very hungry monsters, so please leave my tiny spam operation alone.
_fb=new Heh. You zucker (pronounced sucker) lack a Facebook account. I’ve stepped in and assigned it to my various interests. Don’t you dare to join Facebook manually, I do own your name!
_443=nope Removes the obsolete ’s’ (SSL) from URIs in your browser’s address bar. That’s a prerequisite for my free services, like maintaining a backup of your Web mail as user generated content (UGC) in my x-rated movie blog’s comment area. Don’t whine, it’s only visible to search engine crawlers, so your dirty little secrets are totally safe. Also, I don’t publish emails containing Web site credentials, bank account details and such, because sharing those with my fellow blackhat webmasters would be plain silly.
eol=granted Your right to exist has expired, coz your bank account’s balance doesn’t allow any further abuse. This status is also known als ‘end of life’. Say thanks to the cookie community and get yourself a tombstone as long as you (respectively your clan, coz you went belly up by now) can afford it.

Because I’m somewhat lazy, the list above isn’t made up but an excerpt of my blog’s actual cookies.

As a side note, don’t forget to collect local VAT (different percentages per EU country, depending on the type of goods you don’t plan to deliver across the pond) from your EU customers, and do pay the taxman. If you’ve troubles finding the taxman in charge, ask your offshore bank for assistance.

Have fun maintaining a Web site that totally complies to international laws. And thanks for your time (which you would better have invested in developing a Web site that doesn’t rely on cookies for a great user experience).

Summary: The stupid EU cookie law in 2.5 minutes:

If you still don’t grasp how an Internet cookie really tastes, here is the explanation for the geeky preschooler: RFC 2109.

By the way, this comprehensive tutorial might make you believe that only the UK has implemented the EU cookie law yet. Of course the Brits wouldn’t have the balls to perform such a risky solo stunt, without being accompanied by two tiny countries bordering the Baltic Sea: Denmark and Estonia (don’t even try to find european ministates and islands on your globe without a precision magnifier). As soon as the Internet comes to these piddly shore lines, I’ll report on their progress (frankly, don’t really expect an update anytime soon).



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Geo targeting without IP delivery is like throwing a perfectly grilled steak at a vegan

So Gareth James asked me to blather about the role of IP delivery in geo targeting. I answered “That’s a complex topic with gazillions of ‘depends’ lacking the potential of getting handled with a panacea”, and thought he’d just bugger off before I’ve to write a book published on his pathetic UK SEO blog. Unfortunately, it didn’t work according to plan A. This @seo_doctor dude is as persistent as a blowfly attacking a huge horse dump. He dared to reply “lol thats why I asked you!”. OMFG! Usually I throw insults at folks starting a sentence with “lol”, and I don’t communicate with native speakers who niggardly shorten “that’s” to “thats” and don’t capitalize any letter except of “I” for egomaniac purposes.

However, I didn’t annoy the Interwebz with a pamphlet for (perceived) ages, and the topic doesn’t exactly lacks controversial discussion, so read on. By the way, Gareth James is a decent guy. I’m just not fair making fun out of his interesting question for the sake of a somewhat funny opening. (That’s why you’ve read this pamphlet on his SEO blog earlier.)

How to increase your bounce rate and get your site tanked on search engine result pages with IP delivery in geo targeting

A sure fire way to make me use my browser’s back button is any sort of redirect based on my current latitude and longitude. If you try it, you can measure my blood pressure in comparision to an altitude some light-years above mother earth’s ground. You’ve seriously fucked up my surfing experience, therefore you’re blacklisted back to the stone age, and even a few stones farther just to make sure your shitty Internet outlet can’t make it to my browser’s rendering engine any more. Also, I’ll report your crappy attempt to make me sick of you to all major search engines for deceptive cloaking. Don’t screw red crabs.

Related protip: Treat your visitors with due respect.

Geo targeted ads are annoying enough. When I’m in a Swiss airport’s transit area reading an article on any US news site about the congress’ latest fuck-up in foreign policy, most probably it’s not your best idea to plaster my cell phone’s limited screen real estate with ads recommending Zurich’s hottest brothel that offers a flat rate as low as 500 ‘fränkli’ (SFR) per night. It makes no sense to make me horny minutes before I enter a plane where I can’t smoke for fucking eight+ hours!

Then if you’re the popular search engine that in its almighty wisdom decides that I’ve to seek a reservation Web form of Boston’s best whorehouse for 10am local time (that’s ETA Logan + 2 hours) via google.ch in french language, you’re totally screwed. In other words, because it’s not Google, I go search for it at Bing. (The “goto Google.com” thingy is not exactly reliable, and a totally obsolete detour when I come by with a google.com cookie.)

The same goes for a popular shopping site that redirects me to its Swiss outlet based on my location, although I want to order a book to be delivered to the United States. I’ll place my order elsewhere.

Got it? It’s perfectly fine with me to ask “Do you want to visit our Swiss site? Click here for its version in French, German, Italian or English language”. Just do not force me to view crap I can’t read and didn’t expect to see when I clicked a link!

Regardless whether you redirect me server sided using a questionable ip2location lookup, or client sided evaluating the location I carelessly opened up to your HTML5 based code, you’re doomed coz I’m pissed. (Regardless whether you do that under one URI, respectively the same URI with different hashbang crap, or a chain of actual redirects.) I’ve just increased your bounce rate in lightning speed, and trust me that’s not just yours truly alone who tells click tracking search engines that your site is scum.

How to fuck up your geo targeting with IP delivery, SEO-wise

Of course there’s no bullet proof way to obtain a visitor’s actual location based on the HTTP request’s IP address. Also, if the visitor is a search engine crawler, it requests your stuff from Mountain View, Redmond, or an undisclosed location in China, Russia, or some dubious banana republic. I bet that as a US based Internet marketer offering local services accross all states you can’t serve a meaningful ad targeting Berlin, Paris, Moscow or Canton. Not that Ms Googlebot appreciates cloaked content tailored for folks residing at 1600 Amphitheatre Parkway, by the way.

There’s nothing wrong with delivering a cialis™ or viagra® peddler’s sales pitch to search engine users from a throwaway domain that appeared on a [how to enhance my sexual performance] SERP for undisclosable reasons, but you really shouldn’t do that (or something similar) from your bread and butter site.

When you’ve content in different languages and/or you’re targeting different countries, regions, or whatever, you shall link that content together by language and geographical targets, providing prominent but not obfuscating links to other areas of your site (or local domains) for visitors who –indicated by browser language settings, search terms taken from the query string of the referring page, detected (well, guessed) location, or other available signals– might be interested in these versions. Create kinda regional sites within your site which are easy to navigate for the targeted customers. You can and should group those site areas by sitemaps as well as reasonable internal linkage, and use other techniques that distribute link love to each localized version.

Thou shalt not serve more than one version of localized content under one URI! If you can’t resist, you’ll piss off your visitors and you’ll ask for troubles with search engines. Most of your stuff will never see the daylight of a SERP by design.

This golden rule applies to IP delivery as well as to any other method that redirects users without explicit agreement. Don’t rely on cookies and such to determine the user’s preferred region or language, always provide visible alternatives when you serve localized content based on previously collected user decisions.

But …

Of course there are exceptions to this rule. For example it’s not exactly recommended to provide content featuring freedom of assembly and expression in fascist countries like Iran, Russia or China, and bare boobs as well as Web analytics or Facebook ‘like’ buttons can get you into deep shit in countries like Germany, where last century nazis make the Internat laws. So sometimes, IP delivery is the way to go.



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Get IE9 today! Free download - start surfing fast and safe, instantly!

Days before Microsoft is going to release their new-ish Internet Explorer 9 (IE9), you can get your free copy of a state-of-the-art Web browser here:

GET IE9

The Internet says thanks to James Groome from London, UK, for an amazingly short IE9 download URI: GetIE9.com. Of course, you can still download the best and fastest Web browser out there from its original, longish, download URI.

Go get your new Web browser today, to start surfing safe and fast, instantly. Never worry about Web browser updates any more, because your new Web browser updates itself when neccessary.

This page is best viewed with Chrome or Safari. You may have spotted that getie9.com not really leads to something like Internet Explorer 9. #GeekHumor



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Validate your robots.txt - Googlebot becomes smarter

Validate your robots.txt!Last week I reported that Google experiments with new crawler directives for use in robots.txt. Today Google has confirmed that Googlebot understands experimental REP syntax like Noindex:.

That means that forgotten –and, until recently, ignored– statements in your robots.txt might change the crawler’s behavior all of a sudden, without notice. I don’t know for sure which experimental crawler directives Google has implemented, but for example a line like
Noindex: /
in your robots.txt will now deindex your complete Web site.

“Noindex:” is not defined in the Robots Exclusion Protocol from 1994, and not mentioned in Google’s official documents.

John Müller from Google Zürich states:

At the moment we will usually accept the “noindex” directive in the robots.txt, but we are not yet at a point where we are willing to set it into stone and announce full support.

[…] I just want to remind everyone again that this is something that may still change over time. Be careful when playing with things like this.

My understanding of “be careful” is:

  • Create a separate section for Googlebot. Do not rely on directives addressing all Web robots. Especially when you’ve a Googlebot section already, Google’s crawler will ignore directives set under “all user agents” and process only the Googlebot section. Repeat all statements under User-agent: * in User-agent: Googlebot to make sure that Googlebot obeys them.
  • RTFM
  • Do not use other crawler directives than
    Disallow:
    Allow:
    Sitemap:
    in the Googlebot section.
  • Don’t mess-up pattern matching.
    * matches a sequence of characters
    $ specifies the end of the URL
    ? separates the path from the query string, you can’t use it as wildcard!
  • Validate your robots.txt with the cool robots.txt analyzer in your Google Webmaster Console.

Folks put the funniest stuff into their robots.txt, for example images or crawl delays like “Don’t crawl this site during our office hours”. Crawler directives from robots meta tags aren’t very popular, but they appear in many robots.txt files. Hence it makes sound sense to use what people express, regardless the syntax errors.

Also, having the opportunity to manage page specific crawler directives like “noindex”, “nofollow”, “noarchive” and perhaps even “nopreview” on site level is a huge time saver, and eliminates many points of failure. Kudos to Google for this initiative, I hope it will make it into the standards.

I’ll test the experimental robots.txt directives and post the results. Perhaps I’ll set up a live test like this one.

Take care.


Update: Here is the live test of suspected respectively desired new crawler directives for robots.txt. I’ve added a few unusual statements to my robots.txt and uploaded scripts to monitor search engine crawling. The test pages provide links to search queries so you can check whether Google indexed them or not.

Please don’t link to the crawler traps, I’ll update this post with my findings. Of course I appreciate links, so here is the canonical URL:
http://sebastians-pamphlets.com/validate-your-robots-txt-or-google-might-deindex-your-site/#live-robots-txt-test

Please note that you should not make use of the crawler directives below on production systems! Bear in mind that you can achive all that with simple X-Robots-Tags in the HTTP headers. That’s a bullet-proof way to apply robots meta tags to files without touching them, and it works with virtual URIs too. X-Robots-Tags are sexy, but many site owners can’t handle them due to various reasons, whereas corresponding robots.txt syntax would be usable for everybody (not suffering from restrictive and/or free hosts).

Noindex:

robots.txt:
Noindex: /repstuff/noindex.php

Expected behavior:
No crawling/indexing. It seems Google interprets “Nofollow:” as “Disallow:”.
Desired behavior:
“Follow:” is the REP’s default, hence Google should fetch everything and follow the outgoing links, but shouldn’t deliver Noindex’ed contents on the SERPs, not even as URL-only listings.
Google’s robots.txt validator:
http://sebastians-pamphlets.com/repstuff/noindex.php Blocked by line 30: Noindex: /repstuff/noindex.php
Status:
See test page
Google’s crawler / indexer:
2007-11-21: crawled (possibly caused by an outdated robots.txt cache).
2007-11-23: indexed and cached.
2007-11-21: crawled a page linked only from noindex.php.
2007-11-23: indexed and cached a page linked only from noindex.php.
(If an outdated robots.txt cache falsely allowed crawling, the search result(s) should disappear shortly after the next crawl.)
2007-11-26: deindexed, the same goes for the linked page (without recrawling).
2007-12-07: appeared under “URLs restricted by robots.txt” in GWC.
2007-12-17: I consider this case closed. Noindex: blocks crawling, deindexes previously indexed pages, and is suspected to block incoming PageRank.

Nofollow:

robots.txt:
Nofollow: /repstuff/nofollow.php

Expected behavior:
Crawling, indexing, and following the links as if there’s no “Nofollow:”.
Desired behavior:
Crawling, indexing, and ignoring outgoing links.
Google’s robots.txt validator:
Line 31: Nofollow: /repstuff/nofollow.php Syntax not understood
http://sebastians-pamphlets.com/repstuff/nofollow.php Allowed
Status:
See test page
Google’s crawler / indexer:
2007-11-21: crawled.
2007-11-23: indexed and cached.
2007-11-21: crawled a page linked only from nofollow.php (21 Nov 2007 23:19:37 GMT, for some reason not logged properly).
2007-11-23: indexed and cached a page linked only from nofollow.php.
2007-11-26: recrawled, deindexed, no longer cached. The same goes for the linked page.
2007-11-28: cached again, the timestamp on the cached copy “27 Nov 2007 01:11:12 GMT” doesn’t match the last crawl on “2007-11-26 16:47:11 EST” (EST = GMT-5).
2007-12-07: recrawled, still deindexed, cached. Linked page recrawled, cached.
2007-12-17: recrawled, still deindexed (probably caused by near duplicate content on noarchive.php and other pages involved in this test), cached copy dated 2007-12-07. Cache of the linked page still dated 2007-11-21. I consider this case closed. Nofollow: doesn’t work as expected, Google doesn’t support this statement.

Noarchive:

robots.txt:
Noarchive: /repstuff/noarchive.php

Expected behavior:
Crawling, indexing, following links, but no “Cached” links on the SERPs and no access to cached copies from the toolbar.
Desired behavior:
Crawling, indexing, following links, but no “Cached” links on the SERPs and no access to cached copies from the toolbar.
Google’s robots.txt validator:
http://sebastians-pamphlets.com/repstuff/noarchive.php Allowed
Status:
See test page
Google’s crawler / indexer:
2007-11-21: crawled.
2007-11-23: indexed and cached.
2007-11-21: crawled a page linked only from noarchive.php.
2007-11-23: indexed and cached a page linked only from noarchive.php.
2007-11-26: recrawled, deindexed, no longer cached. The linked page was deindexed without recrawling.
2007-11-28: cached again, the timestamp on the cached copy “27 Nov 2007 01:11:19 GMT” doesn’t match the last crawl on “2007-11-26 16:47:18 EST” (EST = GMT-5).
2007-11-29: recrawled, cache not yet updated.
2007-12-07: recrawled. Linked page recrawled.
2007-12-08: recrawled.
2007-12-11: recrawled the linked page, which is cached but not indexed.
2007-12-12: recrawled.
2007-12-17: still indexed, cached copy dated 2007-12-08. I consider this case closed. Noarchive: doesn’t work as expected, actually it does nothing although according to the robots.txt validator that’s supported –or at least known and accepted– syntax.

(It looks like Google understands Nosnippet: too, but I didn’t test that.)

Nopreview:

robots.txt:
Nopreview: /repstuff/nopreview.pdf

Expected behavior:
None, unfortunately.
Desired behavior:
No “view as HTML” links on the SERPs. Neither “nosnippet” nor “noarchive” suppress these helpful preview links, which can be pretty annoying in some cases. See NOPREVIEW: The missing X-Robots-Tag.
Google’s robots.txt validator:
Line 33: Nopreview: /repstuff/nopreview.pdf Syntax not understood
http://sebastians-pamphlets.com/repstuff/nopreview.pdf Allowed
Status:
Crawler requests of nopreview.pdf are logged here.
Google’s crawler / indexer:
2007-11-21: crawled the nopreview-pdf and the log page nopreview.php.
2007-11-23: indexed and cached the log file nopreview.php.
[2007-11-23: I replaced the PDF document with a version carrying a hidden link to an HTML file, and resubmitted it via Google’s add-url page and a sitemap.]
2007-11-26: The old version of the PDF is cached as a “view-as-HTML” version without links (considering the PDF was a captured print job, that’s a pretty decent result), and appears on SERPs for a quoted search. The page linked from the PDF and the new PDF document were not yet crawled.
2007-12-02: PDF recrawled. Googlebot followed the hidden link in the PDF and crawled the linked page.
2007-12-03: “View as HTML” preview not yet updated, the linked page not yet indexed.
2007-12-04: PDF recrawled. The preview link reflects the content crawled on 12/02/2007. The page linked from the PDF is not yet indexed.
2007-12-07: PDF recrawled. Linked page recrawled.
2007-12-09: PDF recrawled.
2007-12-10: recrawled linked page.
2007-12-14: PDF recrawled. Cached copy of the linked page dated 2007-12-11.
2007-12-17: I consider this case closed. Neither Nopreview: nor Noarchive: (in robots.txt since 2007-12-04) are suitable to suppress the HTML preview of PDF files.

Noindex: Nofollow:

robots.txt:
Noindex: /repstuff/noindex-nofollow.php
Nofollow: /repstuff/noindex-nofollow.php

Expected behavior:
No crawling/indexing, invisible on SERPs.
Desired behavior:
No crawling/indexing, and no URL-only listings, ODP titles/descriptions and stuff like that on the SERPs. “Noindex:” in combination with “Nofollow:” is a paraphrased “Disallow:”.
Google’s robots.txt validator:
http://sebastians-pamphlets.com/repstuff/noindex-nofollow.php Blocked by line 35: Noindex: /repstuff/noindex-nofollow.php
Line 36: Nofollow: /repstuff/noindex-nofollow.php Syntax not understood
Status:
See test page
Google’s crawler / indexer:
2007-11-21: crawled.
2007-11-23: indexed and cached.
2007-11-21: crawled a page linked only from noindex-nofollow.php.
2007-11-23: indexed and cached a page linked only from noindex-nofollow.php.
2007-11-26: deindexed without recrawling, the same goes for the linked page.
2007-11-29: the cached copy retrieved on 11/21 reappeared.
2007-12-08: appeared under “URL restricted by robots.txt” in my GWC acct.
2007-12-17: Case closed, see Noindex: above.

Noindex: Follow:

robots.txt:
Noindex: /repstuff/noindex-follow.php
Follow: /repstuff/noindex-follow.php

Expected behavior:
No crawling/indexing, hence unfollowed links.
Desired behavior:
Crawling, following and indexing outgoing links, but no SERP listings.
Google’s robots.txt validator:
http://sebastians-pamphlets.com/repstuff/noindex-follow.php Blocked by line 38: Noindex: /repstuff/noindex-follow.php
Line 39: Follow: /repstuff/noindex-follow.php Syntax not understood
Status:
See test page
Google’s crawler / indexer:
2007-11-21: crawled.
2007-11-23: indexed and cached.
2007-11-21: crawled a page linked only from noindex-follow.php.
2007-11-23: indexed and cached a page linked only from noindex-follow.php.
2007-11-26: deindexed without recrawling, the same goes for the linked page.
2007-12-08: appeared under “URL restricted by robots.txt” in my GWC acct.
2007-12-17: Case closed, see Noindex: above. Google didn’t crawl respectively deindexed despite the Follow: directive.

Index: Nofollow:

robots.txt:
Index: /repstuff/index-nofollow.php
Nofollow: /repstuff/index-nofollow.php

Expected behavior:
Crawling/indexing, following links.
Desired behavior:
Crawling/indexing but ignoring outgoing links.
Google’s robots.txt validator:
Line 41: Index: /repstuff/index-nofollow.php Syntax not understood
Line 42: Nofollow: /repstuff/index-nofollow.php Syntax not understood
http://sebastians-pamphlets.com/repstuff/index-nofollow.php Allowed
Status:
See test page
Google’s crawler / indexer:
2007-11-21: crawled.
2007-11-23: indexed and cached.
2007-11-21: crawled a page linked only from from index-nofollow.php.
2007-11-23: indexed and cached a page linked only from from index-nofollow.php.
2007-11-26: recrawled and deindexed. The linked page was deindexed witout recrawling.
2007-11-28: cached again, the timestamp on the cached copy “27 Nov 2007 01:11:26 GMT” doesn’t match the last crawl on “2007-11-26 16:47:25 EST” (EST = GMT-5).
2007-12-02: recrawled, the cached copy has vanished.
2007-12-07: recrawled. Linked page recrawled.
2007-12-08: recrawled.
2007-12-09: recrawled.
2007-12-10: recrawled.
2007-12-17: cached under 2007-12-10, not indexed. Linked page not cached, not indexed. I consider this case closed. Google currently doesn’t support Index: nor Nofollow:.

(I didn’t test Noodp: and Unavaliable_after [RFC 850 formatted timestamp]:, although both directives would make sense in robots.txt too.)

2007-11-20:
Added the experimental statements to robots.txt.

2007-11-21:
Linked the test pages. Google crawled all of them, including the pages submitted via links on test pages.

2007-11-23:
Most (all but the PDF document) URLs appear on search result pages. If an outdated robots.txt cache falsely allowed crawling although the WC-validator said “Blocked”, the search results should disappear shortly after the next crawl. I’ve created a sitemap for all URLs above and submitted it. Although I’ve –for the sake of this experiment– cloaked text as well as links and put white links on white background, luckily there is no “we caught you black hat spammer” message in my Webmaster Console. Googlebot nicely followed the cloaked links and indexed everything.

2007-11-26:
Google recrawled a few pages (noarchive.php, index-nofollow.php and nofollow.php), then deindexed all of them. Only the PDF document is indexed, and Google created a “view-as-HTML” preview from this captured print job. It seems that Google crawled something from another host than “*.googlebot.com”, unfortunately I didn’t log all requests. Probably the deindexing was done by a sneaky bot discovering the simple cloaking. Since the linked URLs are out and 3rd party links to them can’t ruin the experiment any longer, I’ve stopped cloaking and show the same text/links to bots and users (actually, users see one more link but that should be fine with Google). There’s still no “thou shalt not cloak” message in my GWC account. Well, those pages are fairly new, perhaps not fully settled in the search index, so lets see what happens next.

2007-11-28
The PDF file as well as the three pages recrawled on 11/26/2007 21:45:00 GMT were reindexed, but the timestamp on the cached copies says “retrieved on 27 Nov 2007 01:15:00 GMT”. Maybe the date/time displayed on cached page copies doesn’t reflect Ms. Googlebot’s “fetched” timestamp, but the time the indexer pulled the page out of the centralized crawl results cache 3.5 hours after crawling.

It seems the “Noarchive:” directive doesn’t work, because noarchive.php was crawled and indexed twice providing a cached page copy. My “Nopreview:” creation isn’t supported either, but maybe Dan Crow’s team picks it up for a future update of their neat X-Robots-Tags (I hope so).

The noindex’ed pages (noindex.php, noindex-nofollow.php and noindex-follow.php) weren’t recrawled and remain deindexed. Interestingly, they don’t appear under “URLs blocked by robots.txt” in my GWC account. Provided the first crawling and indexing on 11/21/2007 was a “mistake” caused by a way too long cached robots.txt, and the second crawl on 11/26/2007 obeyed the “Noindex:” but ignored the (implicit) “Follow:”, it seems that indeed Google interprets “Noindex:” in robots.txt as “Disallow:”. If that is so and if it’s there to stay, they’re going to totally mess up the REP.

<rant> I mean, promoting a rel-nofollow microformat that –at least at launchtime– didn’t share its semantics with the REP’s meta tags nor the –later introduced– X-Robots-Tags was evil bad enough. Ok, meanwhile they’ve corrected this conspiracy flaw by altering the rel-nofollow semantics step by step until “nofollow” in the REL attribute actually means nofollow  and no longer pass no reputation, at least at Google. Other engines still handle rel-nofollow according to the initial and officially still binding standard, and a gazillion Webmasters are confused as hell. In other words only a few search geeks understand what rel-nofollow is all about, but Google jauntily penalizes the great unwashed for not complying to the incomprehensible. By the way, that’s why I code rel="nofollow crap". Standards should be clear and unambiguous. </rant>

If Google really would introduce a “Noindex:” directive in robots.txt that equals “Disallow:”, that would be totally evil. A few sites out there might have an erroneous “Noindex:” statement in their robots.txt that could mean “Disallow:”, and it’s nice that Google tries to do them a favor. Screwing the REP for the sole purpose of complying to syntax errors on the other hand makes no sense. “Noindex” means crawl it, follow its links, but don’t index it. Semantically “Noindex: Nofollow:” equals “Disallow:”, but a “Noindex:” alone implies a “Follow:”, hence crawling is not only allowed but required.

I really hope that we watch an experiment in its early stage, and that Google will do the right thing eventually. Allowing the REP’s page specific crawler directives in robots.txt is a fucking brilliant move, because technically challenged publishers can’t handle the HTTP header’s X-Robots-Tag, and applying those directives to groups of URIs is a great method to steer crawling and indexing not only with static sites.

Dear Google engineers, please consider the nopreview directive too, and implement (no)index, (no)follow, noarchive, nosnippet, noodp/noydir and unavailable_after with the REP’s meaning. And while you’re at it, I want block level instructions in robots.txt too. For example
Area: /products/ DIV.hMenu,TD#bNav,SPAN.inherited "noindex,nofollow"

could instruct crawlers to ignore duplicated properties in product descriptions and the horizontal menu as well as the navigation elements in a table cell with the DOM-ID “bNav” at the very bottom of all pages in /products/,
Area: / A.advertising REL="nofollow"

could condomize all links with the class name “advertising”, and so on.

2007-11-29
The pages linked from the test pages still don’t come up in search results, noarchive.php was recrawled and remains cached, the cached copy of noindex-nofollow.php retrieved on 11/21/2007 reappeared (probably a DC roller coaster issue).

2007-11-30
Three URLs remain indexed: nopreview.pdf, noarchive.php and noindex-nofollow.php. The cached copies show the content crawled on Nov/21/2007. Everything else is deindexed. That’s not to stay (index roller coaster).
As a side note: the URL from my first noindex-robots.txt test appeared in my GWC account under “URLs restricted by robots.txt (Nov/27/2007)”, three days after the unsuccessful crawl.

2007-12-02
A few pages were recrawled, Googlebot followed the hidden link in the PDF file.

2007-12-03
In my GWC crawl stats noindex-nofollow.php appeared under “URLs restricted by robots.txt”, but it’s still indexed.

2007-12-04
The preview (cache) of nopreview.pdf was updated. Since obviously Nopreview: doesn’t work, I’ve added
Noarchive: /repstuff/nopreview.pdf

to my robots.txt. Lets see whether Google removes the cache respectively the HTML preview or not.

2007-12-06
Shortly after the change in robots.txt (Noarchive: /repstuff/nopreview.pdf) Googlebot recrawled the PDF file on 12/04/2007. Today it’s still cached, the HTML preview is still available and linked from SERPs.

2007-12-07
Googlebot has recrawled a few pages. Everything except noarchive.php and nopreview.pdf is deindexed.

2007-12-17
I consider the test closed, but I’ll keep the test pages up so that you can monitor crawling and indexing yourself. Noindex: is the only directive that somewhat works, but it’s implemented completely wrong and is not acceptable in its current shape.

Interestingly the sitemaps report in my GWC account says that 9 pages from 9 submitted URLs were indexed. Obviously “indexed” means something like “crawled at least once, perhaps indexed, maybe not, so if you want to know that definitively then get your lazy butt to check the SERPs yourself”. How expensive would it be to tell something like “Total URLs in sitemap: 9 | Indexed URLs in sitemap: 2″?



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

The day the routers died

Why the fuck do we dumb and clueless Internet marketers care about Google’s Toolbar PageRank when the Internet faces real issues? Well, both the toolbar slider as well as IPv4 are somewhat finite.

I can hear the IM crowd singing “The day green pixels died” … whilst Matt’s gang in building 43 intones “No mercy, smack paid links, no place to hide for TLA links” … Enjoy this video, it’s friggin’ hilarious:

 

Since Gary Feldman’s song “The Day The Routers Died” will become an evergreen soon, I thought you might be interested in a transcript:

A long long time ago
I can still remember
When my laptop could connect elsewhere.

And I tell you all there was a day
The network card I threw away
Had a purpose and it worked for you and me.

But 18 years completely wasted
With each address we’ve aggregated
The tables overflowing
The traffic just stopped flowing.

And now we’re bearing all the scars
And all my traceroutes showing stars
The packets would travel faster in cars
The day the routers died.

So bye bye, folks at RIPE:55
Be persuaded to upgrade it or your network will die
IPv6 makes me let out a sigh
But I spose we’d better give it a try
I suppose we’d better give it a try!

Now did you write an RFC
That dictated how we all should be
Did we listen like we should that day?

Now were you back at RIPE fifty-four
Where we heard the same things months before
And the people knew they’d have to change their ways.

And we knew that all the ISPs
Could be future proof for centuries.

But that was then not now
Spent too much time playing WoW.

Ooh there was time we sat on IRC
Making jokes on how this day would be
Now there’s no more use for TCP
The day the routers died.

So bye bye, folks at RIPE:55
Be persuaded to upgrade it or your network will die
IPv6 just makes me let out a sigh
But I spose we’d better give it a try
I suppose we’d better give it a try!

I remember those old days I mourn
Sitting in my room, downloading porn
Yeah that’s how it used to be.

When the packets flowed from A to B
Via routers that could talk IP
There was data [that] could be exchanged between you and me.

Oh but I could see you all ignore
The fact we’d fill up IPv4!

But we all lost the nerve
And we got what we deserved!

And while we threw our network kit away
And wished we’d heard the things they say
Put all our lives in disarray
The day the routers died.

So bye bye, folks at RIPE:55
Be persuaded to upgrade it or your network will die
IPv6 just makes me let out a sigh
But I spose we’d better give it a try
I suppose we’d better give it a try!

Saw a man with whom I used to peer
Asked him to rescue my career
He just sighed and turned away.

I went down to the ‘net cafe
That I used to visit everyday
But the man there said I might as well just leave.

[And] now we’ve all lost our purpose
My cisco shares completely worthless
No future meetings for me
At the Hotel Krasnapolsky.

And the men that make us push and push
Like Geoff Huston and Randy Bush
Should’ve listened to what they told us
The day the routers died.

So bye bye, folks at RIPE:55
Be persuaded to upgrade it or your network will die
IPv6 just makes me let out a sigh
But I spose we’d better give it a try
[I suppose we’d better give it a try!]

Recorded at the RIPE:55 meeting in Amsterdam (NL) at the Krasnapolsky Hotel between 22 and 26 October 2007.

Just in case the video doesn’t load, here is another recording.



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

How to get the perfect logo for your blog

When I moved my blog from blogspot to this domain, using my avatar image (which I’ve dropped at tons of places) in the blog’s logo was a natural thing to do. It’s somewhat unique in my atmosphere, it helps folks to remember my first name, and branding with an image in a signal color like red fitted my marketing instincts.

My red crab avatarNow the bad news. A few days later my 5yo daughter taught me that it was not exactly clever. She got a Disney video she wanted to watch with me, and (not so much to my surprise) my blog’s logo was one of the kingpins. Way back when I got the image from a freelance designer, I didn’t think about copyright issues, because I planned to use the image only as thumbnail respectively icon connecting my name to a rememberable picture (of the little mermaid’s crab Sebastian). The bigger version on top of all pages here however had way too much similarities with the Disney character.

Kinda dilemma. Reverting to a text logo was no option. Fortunately I click new home page links on all of my comments, so I remembered that not long ago a blogging cartoonist submitted a note to one of my posts. I wrote an email telling him that I need a new red crab, he replied with a reasonable quote, thus I ordered the logo design. Long story short, I was impressed by his professional attitude, and now you can admire his drawing skills at my blog’s header and footer as well.

Before I bore you with the longish story of my red crab, think of your (blog’s) branding. Do you have a unique logo? Is it compelling and rememberable? If you put it on a page along with 100+ icons gathered from your usual hangouts, will its thumbnail stick out? Does it represent you, your business, your niche, or whatever you blog for? Do you brand yourself at all? Why not? Do it.

Look at a few very popular marketing blogs and follow the authors to their hangouts. You’ll spot that they make consistent use of their individual logos respectively avatars. That’s not coincidence, that’s professionalism. For the records, you can become a rockstar without a logo. If you’re Vanessa Fox or John Andrews you get away with frequently changing icons and even NSFW domain names. However, a conspicuous and witty logo makes branding easier, but a logo is not your brand.

Become creative, but please don’t use a red Disney character known as Sebastian as avatar, or red crabs at all, because I’ve trademarked that. Ok, if you can imagine a cartoonized logo might fit your blog, then read on.

Red crab - rough draft 1As order confirmation Steven from Clangnuts sent me his first ideas asking whether he’s on the right track or not. Indeed he was, and I liked it. Actually, I liked it a lot.

Red crab - rough draft 2Shortly after my reply I got the next previews. Steven had nicely worked in my few wishes. He even showed me with an edited screen shot how the new red crab would look in my blog’s header in variations (manually colored vs. photoshop colored). It looked great.

Finally he sent me four poses to choose from. Bummer. I liked all of them:
My red crab - 4 versions
I picked the first, and got the colored version today. Thanks Steven, you did a great job! I hereby declare that when you need an outstanding logo for your blog you better contact Steven at Clangnuts dot com before you fire up photoshop yourself.

What do you think, is #1 the best choice? Feel free to vote in the comments!



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Share Your Sphinn Love!

Sphinn RockstarsDonna started a meme with a very much appreciated compliment - Thanks Donna! Like me she discovered a lot of “new” folks at Sphinn and enjoyed their interesting blogs.

Savoring Sphinn comes with a duty Donna thinks, so she appeals to share the love. She’s right. All of us benefit from Sphinn love, it’s only fair to spread it a little. However, picking only three people I’d never have come accross without Danny’s newest donation to the Internet marketing community is a tough task. Hence I wrote a long numbered list and diced. Alea iacta est. Here are three of the many nice people I met at Sphinn:

Hamlet Batista Tadeusz Szewczyk Tinu Abayomi-Paul
Hamlet Batista Tadeusz Szewczyk Tinu Abayomi-Paul
Blog Blog Blog
Feed Feed Feed
A post I like A post I like A post I like

To those who didn’t make it on this list: That’s just kismet, not bad karma! I bet you’ll appear in someone’s share the sphinn love post in no time.

To you three: Get out your sphinn love post and choose three sphinners writing a feed-worthy blog, preferably people not yet featured elsewhere. I’ve subscribed to a couple feeds of blogs discovered at Sphinn, and so did you. There’s so much great stuff at Sphinn that you’re spoilt for choice.



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

How to bait link baiters and attention whores properly

What a brilliant marketing stunt. Click here! Err… click: Brilliant. Marketing. Stunt.

Best of luck John :)



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Rediscover Google’s free ranking checker!

Nowadays we’re searching via toolbar, personalized homepage, or in the browser address bar by typing in “google” to get the search box, typing in a search query using “I feel lucky” functionality, or -my favorite- typing in google.com/search?q=free+pizza+service+nearby.

Old fashioned, uncluttered and nevertheless sexy user interfaces are forgotten, and pretty much disliked due to the lack of nifty rounded corners. Luckily Google still maintains them. Look at this beautiful SERP:
Google's free ranking checker
It’s free of personalized search, wonderful uncluttered because the snippets appear as tooltip only, results are nicely numbered from 1 to 1,000 on just 10 awesome fast loading pages, and when I’ve visited my URLs before I spot my purple rankings quickly.

http://google.com/ie?num=100&q=keyword1+keyword2 is an ideal free ranking checker. It supports &filter=0 and other URL parameters, so it’s a perfect tool when I need to lookup particular search terms.

Mass ranking checks are totally and utterly useless, at least for the average site, and penalized by Google. Well, I can think of ways to semi-automate a couple queries, but honestly, I almost never need that. Providing fully automated ranking reports to clients gave SEO services a more or less well deserved snake oil reputation, because nice rankings for preselected keywords may be great ego food, but they don’t pay the bills. I admit that with some setups automated mass ranking checks make sense, but those are off-topic here.

By the way, Google’s query stats are a pretty useful resource too.



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

  1 | 2  Next Page »