Sebastian's Pamphlets

Why storing URLs with truncated trailing slashes is an utterly idiocy

Posted on 6 February, 2008

Yahoo steals my trailing slashes With some Web services URL canonicalization has a downside. What works great for major search engines like Google can fire back when a Web service like Yahoo thinks circumcising URLs is cool. Proper URL canonicalization might, for example, screw your blog’s reputation at Technorati.

In fact the problem is not your URL canonicalization, e.g. 301 redirects from http://example.com to http://example.com/ respectively http://example.com/directory to http://example.com/directory/, but crappy software that removes trailing forward slashes from your URLs.

Dear Web developers, if you really think that home page locations respectively directory URLs look way cooler without the trailing slash, then by all means manipulate the anchor text, but do not manipulate HREF values, and do not store truncated URLs in your databases (not that “http://example.com” as anchor text makes any sense when the URL in HREF points to “http://example.com/”). Spreading invalid URLs is not funny. People as well as Web robots take invalid URLs from your pages for various purposes. Many usages of invalid URLs are capable to damage the search engine rankings of the link destinations. You can’t control that, hence don’t screw our URLs. Never. Period.

Folks who don’t agree with the above said read on.

TOC:

What is a trailing slash? About URLs, directory URIs, default documents, directory indexes, …
How to rescue stolen trailing slashes About Apache’s handling of directory requests, and rewriting respectively redirecting invalid directory URIs in .htaccess as well as in PHP scripts.
Why stealing trailing slashes is not cool Truncating slashes is not only plain robbery (bandwidth theft), it often causes malfunctions at the destination server and 3rd party services as well.
How URL canonicalization irritates Technorati 301 redirects that “add” a trailing slash to directory URLs, respectively virtual URIs that mimic directories, seem to irritate Technorati so much that it can’t compute reputation, recent post lists, and so on.

What is a trailing slash?

The Web’s standards say (links and full quotes): The trailing path segment delimiter “/” represents an empty last path segment. Normalization should not remove delimiters when their associated component is empty. (Read the polite “should” as “must”.)

To understand that, lets look at the most common URL components:
scheme:// server-name.tld /path ?query-string #fragment
The (red) path part begins with a forward slash “/” and must consist of at least one byte (the trailing slash itself in case of the home page URL http://example.com/).

If an URL ends with a slash, it points to a directory’s default document, or, if there’s no default document, to a list of objects stored in a directory. The home page link lacks a directory name, because “/” after the TLD (.com|net|org|…) stands for the root directory.

Automated directory indexes (a list of links to all files) should be forbidden, use Options -Indexes in .htaccess to send such requests to your 403-Forbidden page.

In order to set default file names and their search sequence for your directories use DirectoryIndex index.html index.htm index.php /error_handler/missing_directory_index_doc.php. In this example: on request of http://example.com/directory/ Apache will first look for /directory/index.html, then if that doesn’t exist for /directory/index.htm, then /directory/index.php, and if all that fails, it will serve an error page (that should log such requests so that the Webmaster can upload the missing default document to /directory/).

The URL http://example.com (without the trailing slash) is invalid, and there’s no specification telling a reason why a Web server should respond to it with meaningful contents. Actually, the location http://example.com points to Null (nil, zilch, nada, zip, nothing), hence the correct response is “404 - we haven’t got ‘nothing to serve’ yet”.

The same goes for sub-directories. If there’s no file named “/dir”, the URL http://example.com/dir points to Null too. If you’ve a directory named “/dir”, the canonical URL http://example.com/dir/ either points to a directory index page (an autogenerated list of all files) or the directory’s default document “index.(html|htm|shtml|php|…)”. A request of http://example.com/dir -without the trailing slash that tells the Web server that the request is for a directory’s index- resolves to “not found”.

You must not reference a default document by its name! If you’ve links like http://example.com/index.html you can’t change the underlying technology without serious hassles. Say you’ve a static site with a file structure like /index.html, /contact/index.html, /about/index.html and so on. Tomorrow you’ll realize that static stuff sucks, hence you’ll develop a dynamic site with PHP. You’ll end up with new files: /index.php, /contact/index.php, /about/index.php and so on. If you’ve coded your internal links as http://example.com/contact/ etc. they’ll still work, without redirects from .html to .php. Just change the DirectoryIndex directive from “… index.html … index.php …” to “… index.php … index.html …”. (Of course you can configure Apache to parse .html files for PHP code, but that’s another story.)

It seems that truncating default document names can make sense for services that deal with URLs, but watch out for sites that serve different contents under various extensions of “index” files (intentionally or not). I’d say that folks submitting their ugly index.html files to directories, search engines, top lists and whatnot deserve all the hassles that come with later changes.

How to rescue stolen trailing slashes

Since Web servers know that users are faulty by design, they jump through a couple of resource burning hoops in order to either add the trailing slash so that relative references inside HTML documents (CSS/JS/feed links, image locations, HREF values …) work correctly, or apply voodoo to accomplish that without (visibly) changing the address bar.

With Apache, DirectorySlash On enables this behavior (check whether your Apache version does 301 or 302 redirects, in case of 302s find another solution). You can also rewrite invalid requests in .htaccess when you need special rules: RewriteEngine on RewriteBase /content/ RewriteRule ^dir1$ http://example.com/content/dir1/ [R=301,L] RewriteRule ^dir2$ http://example.com/content/dir2/ [R=301,L]

With content management systems (CMS) that generate virtual URLs on the fly, often there’s no other chance than hacking the software to canonicalize invalid requests. To prevent search engines from indexing invalid URLs that are in fact duplicates of canonical URLs, you’ll perform permanent redirects (301).

Here is a WordPress (header.php) example: $requestUri = $_SERVER["REQUEST_URI"]; $queryString = $_SERVER["QUERY_STRING"]; $doRedirect = FALSE; $fileExtensions = array(".html", ".htm", ".php"); $serverName = $_SERVER["SERVER_NAME"]; $canonicalServerName = $serverName; // if you prefer http://example.com/* URLs remove the "www.": $srvArr = explode(".", $serverName); $canonicalServerName = $srvArr[count($srvArr) - 2] ."." .$srvArr[count($srvArr) - 1]; $url = parse_url ("http://" .$canonicalServerName .$requestUri); $requestUriPath = $url["path"]; if (substr($requestUriPath, -1, 1) != "/") { $isFile = FALSE; foreach($fileExtensions as $fileExtension) { if ( strtolower(substr($requestUriPath, strlen($fileExtension) * -1, strlen($fileExtension))) == strtolower($fileExtension) ) { $isFile = TRUE; } } if (!$isFile) { $requestUriPath .= "/"; $doRedirect = TRUE; } } $canonicalUrl = "http://" .$canonicalServerName .$requestUriPath; if ($queryString) { $canonicalUrl .= "?" . $queryString; } if ($url["fragment"]) { $canonicalUrl .= "#" . $url["fragment"]; } if ($doRedirect) { @header("HTTP/1.1 301 Moved Permanently", TRUE, 301); @header("Location: $canonicalUrl"); exit; }
Check your permalink settings and edit the values of $fileExtensions and $canonicalServerName accordingly. For other CMSs adapt the code, perhaps you need to change the handling of query strings and fragments. The code above will not run under IIS, because it has no REQUEST_URI variable.

Why stealing trailing slashes is not cool

This section expressed in one sentence: Cool URLs don’t change, hence changing other people’s URLs is not cool.

Folks should understand the “U” in URL as unique. Each URL addresses one and only one particular resource. Technically spoken, if you change one single character of an URL, the altered URL points to a different resource, or nowhere.

Think of URLs as phone numbers. When you call 555-0100 you reach the switchboard, 555-0101 is the fax, and 555-0109 is the phone extension of somebody. When you steal the last digit, dialing 555-010, you get nowhere.

Yahoo'ish fools steal our trailing slashes Only a fool would assert that a phone number shortened by one digit is way cooler than the complete phone number that actually connects somewhere. Well, the last digit of a phone number and the trailing slash of a directory link aren’t much different. If somebody hands out an URL (with trailing slash), then use it as is, or don’t use it at all. Don’t “prettify” it, because any change destroys its serviceability.

If one requests a directory without the trailing slash, most Web servers will just reply to the user agent (brower, screen reader, bot) with a redirect header telling that one must use a trailing slash, then the user agent has to re-issue the request in the formally correct way. From a Webmaster’s perspective, burning resources that thoughtlessly is plain theft. From a user’s perspective, things will often work without the slash, but they’ll be quicker with it. “Often” doesn’t equal “always”:

Some Web servers will serve the 404 page.
Some Web servers will serve the wrong content, because /dir is a valid script, virtual URI, or page that has nothing to do with the index of /dir/.
Many Web servers will respond with a 302 HTTP response code (Found) instead of a correct 301-redirect, so that most search engines discovering the sneakily circumcised URL will index the contents of the canonical URL under the invalid URL. Now all search engine users will request the incomplete URL too, running into unnecessary redirects.
Some Web servers will serve identical contents for /dir and /dir/, that leads to duplicate content issues with search engines that index both URLs from links. Most Web services that rank URLs will assign different scorings to all known URL variants, instead of accumulated rankings to both URLs (which would be the right thing to do, but is technically, well, challenging).
Some user agents can’t handle (301) redirects properly. Exotic user agents might serve the user an empty page or the redirect’s “error message”, and Web robots like the crawlers sent out by Technorati or MSN-LiveSearch hang up respectively process garbage.

Does it really make sense to maliciously manipulate URLs just because some clueless developers say “dude, without the slash it looks way cooler”? Nope. Stealing trailing slashes in general as well as storing amputated URLs is a brain dead approach.

KISS (keep it simple, stupid) is a great principle. “Cosmetic corrections” like trimming URLs add unnecessary complexity that leads to erroneous behavior and requires even more code tweaks. GIGO (garbage in, garbage out) is another great principle that applies here. Smart algos don’t change their inputs. As long as the input is processible, they accept it, otherwise they skip it.

Exceptions

URLs in print, radio, and offline in general, should be truncated in a way that browsers can figure out the location - “domain.co.uk” in print and “domain dot co dot uk” on radio is enough. The necessary redirect is cheaper than a visitor who doesn’t type in the canonical URL including scheme, www-prefix, and trailing slash.

How URL canonicalization seems to irritate Technorati

Due to the not exactly responsively (respectively swamped) Technorati user support parts of this section should be interpreted as educated speculation. Also, I didn’t research enough cases to come to a working theory. So here is just the story “how Technorati fails to deal with my blog”.

When I moved my blog from blogspot to this domain, I’ve enhanced the faulty WordPress URL canonicalization. If any user agent requests http://sebastians-pamphlets.com it gets redirected to http://sebastians-pamphlets.com/. Invalid post/page URLs like http://sebastians-pamphlets.com/about redirect to http://sebastians-pamphlets.com/about/. All redirects are permanent, returning the HTTP response code “301″.

I’ve claimed my blog as http://sebastians-pamphlets.com/, but Technorati shows its URL without the trailing slash. …<div class="url"><a href="http://sebastians-pamphlets.com">http://sebastians-pamphlets.com</a> </div> <a class="image-link" href="/blogs/sebastians-pamphlets.com"><img …

By the way, they forgot dozens of fans (folks who “fave’d” either my old blogspot outlet or this site) too.
Blogs claimed at Technorati

I’ve added a description and tons of tags, that both don’t show up on public pages. It seems my tags were deleted, at least they aren’t visible in edit mode any more.
Edit blog settings at Technorati

Shortly after the submission, Technorati stopped to adjust the reputation score from newly discovered inbound links. Furthermore, the list of my recent posts became stale, although I’ve pinged Technorati with every update, and technorati received my update notifications via ping services too. And yes, I’ve tried manual pings to no avail.

I’ve gained lots of fresh inbound links, but the authority score didn’t change. So I’ve asked Technorati’s support for help. A few weeks later, in December/2007, I’ve got an answer:

I’ve taken a look at the issue regarding picking up your pings for “sebastians-pamphlets.com”. After making a small adjustment, I’ve sent our spiders to revisit your page and your blog should be indexed successfully from now on.

Please let us know if you experience any problems in the future. Do not hesitate to contact us if you have any other questions.

Indeed, Technorati updated the reputation score from “56″ to “191″, and refreshed the list of posts including the most recent one.

Of course the “small adjustment” didn’t persist (I assume that a batch process stole the trailing slash that the friendly support person has added). I’ve sent a follow-up email asking whether that’s a slash issue or not, but didn’t receive a reply yet. I’m quite sure that Technorati doesn’t follow 301-redirects, so that’s a plausible cause for this bug at least.

Since December 2007 Technorati didn’t update my authority score (just the rank goes up and down depending on the number of inbound links Technorati shows on the reactions page - by the way these numbers are often unreal and change in the range of hundreds from day to day).
Blog reactions and authority scoring at Technorati

It seems Technorati didn’t index my posts since then (December/18/2007), so probably my outgoing links don’t count for their destinations.
Stale list of recent posts at Technorati

(All screenshots were taken on February/05/2008. When you click the Technorati links today, it ~~could~~ hopefully will look differently.)

I’m not amused. I’m curious what would happen when I add if (!preg_match("/Technorati/i", "$userAgent")) {/* redirect code */}
to my canonicalization routine, but I can resist to handle particular Web robots. My URL canonicalization should be identical both for visitors and crawlers. Technorati should be able to fix this bug without code changes at my end or weeky support requests. Wishful thinking? Maybe.

Update 2008-03-06: Technorati crawls my blog again. The 301 redirects weren’t the issue. I’ll explain that in a follow-up post soon.

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

52 comments Sebastian | Blogging, Usability, Technorati, Duplicate Content, Web development, .htaccess, Anchor Text, Crap, SEO

The hacker tool MSN-LiveSearch is responsible for brute force attacks

Posted on 1 February, 2008

401 = Private Property, keep out! A while ago I’ve staged a public SEO contest, asking whether the 401 HTTP response code prevents from search engine indexing or not.

Password protected site areas should be safe from indexing, because legit search engine crawlers do not submit user/password combos. Hence their try to fetch a password protected URL bounces with a 401 HTTP response code that translates to a polite “Authorization Required”, meaning “Forbidden unless you provide valid authorization”.

Experience of life and common sense tell search engines, that when a Webmaster protects content with a user/password query, this content is not available to the public. Search engines that respect Webmasters/site owners do not point their users to protected content.

Also, that makes no sense for the search engine. Searchers submitting a query with keywords that match a protected URL would be pissed when they click the promising search result on the SERP, but the linked site responds with an unfriendly “Enter user and password in order to access [title of the protected area]”, that resolves to a harsh error message because the searcher can’t provide such information, and usually can’t even sign up from the 401 error page¹.

Unfortunately, search results that contain URLs of password protected content are valuable tools for hackers. Many content management systems and payment processors that Webmasters use to protect and monetize their contents leave footprints in URLs, for example /members/. Even when those systems can handle individual URLs, many Webmasters leave default URLs in place that are either guessable or well known on the Web.

Developing a script that searches for a string like /members/ in URLs and then “tests” the search results with brute force attacks is a breeze. Also, such scripts are available (for a few bucks or even free) at various places. Without the help of a search engine that provides the lists of protected URLs, the hacker’s job is way more complicated. In other words, search engines that list protected URLs on their SERPs willingly support and encourage hacking, content theft, and DOS-like server attacks.

Ok, lets look at the test results. All search engines have casted their votes now. Here are the winners:

Google

Once my test was out, Matt Cutts from Google researched the question and told me:

My belief from talking to folks at Google is that 401/forbidden URLs that we crawl won’t be indexed even as a reference, so .htacess password-protected directories shouldn’t get indexed as long as we crawl enough to discover the 401. Of course, if we discover an URL but didn’t crawl it to see the 401/Forbidden status, that URL reference could still show up in Google.

Well, that’s exactly the expected behavior, and I wasn’t surprised that my test results confirm Matt’s statement. Thanks to Google’s BlitzIndexing™ Ms. Googlebot spotted the 401 so fast, that the URL never showed up on Google’s SERPs. Google reports the protected URL in my Webmaster Console account for this blog as not indexable.

Yahoo

Yahoo’s crawler Slurp also fetched the protected URL in no time, and Yahoo did the right thing too. I wonder whether or not that’s going to change if M$ buys Yahoo.

Ask

Ask’s crawler isn’t the most diligent Web robot out there. However, somehow Ask has managed not to index a reference to my password protected URL.

And here is the ultimate loser:

MSN LiveSearch

Oh well. Obviously MSN LiveSearch is a must have in a deceitful cracker’s toolbox:

MSN LiveSearch indexes password protected URLs

As if indexing references to password protected URLs wouldn’t be crappy enough, MSN even indexes sitemap files that are referenced in robots.txt only. Sitemaps are machine readable URL submission files that have absolute no value for humans. Webmasters make use of sitemap files to mass submit their URLs to search engines. The sitemap protocol, that MSN officially supports, defines a communication channel between Webmasters and search engines - not searchers, and especially not scrapers that can use indexed sitemaps to steal Web contents more easily. Here is a screen shot of an MSN SERP:

MSN LiveSearch indexes unlinked sitemaps files (MSN SERP)

All the other search engines got the sitemap submission of the test URL too, but none of them fell for it. Neither Google, Yahoo, nor Ask have indexed the sitemap file (they never index submitted sitemaps that have no inbound links by the way) or its protected URL.

Summary

All major search engines except MSN respect the 401 barrier.

Since MSN LiveSearch is well known for spamming, it’s not a big surprise that they support hackers, scrapers and other content thieves.

Of course MSN search is still an experiment, operating in a not yet ready to launch stage, and the big players made their mistakes in the beginning too. But MSN has a history of ignoring Web standards as well as Webmaster concerns. It took them two years to implement the pretty simple sitemaps protocol, they still can’t handle 301 redirects, their sneaky stealth bots spam the referrer logs of all Web sites out there in order to fake human traffic from MSN SERPs (MSN traffic doesn’t exist in most niches), and so on. Once pointed to such crap, they don’t even fix the simplest bugs in a timely manner. I mean, not complying to the HTTP 1.1 protocol from the last century is an evidence of incapacity, and that’s just one example.

Update Feb/06/2008: Last night I’ve received an email from Microsoft confirming the 401 issue. The MSN Live Search engineer said they are currently working on a fix, and he provided me with an email address to report possible further issues. Thank you, Nathan Buggia! I’m still curious how MSN Live Search will handle sitemap files in the future.

¹ Smart Webmasters provide sign up as well as login functionality on the page referenced as ErrorDocument 401, but the majority of all failed logins leave the user alone with the short hard coded 401 message that Apache outputs if there’s no 401 error document. Please note that you shouldn’t use a PHP script as 401 error page, because this might disable the user/password prompt (due to a PHP bug). With a static 401 error page that fires up on invalid user/pass entries or a hit on the cancel button, you can perform a meta refresh to redirect the visitor to a signup page. Bear in mind that in .htaccess you must not use absolute URLs (http://… or https://…) in the ErrorDocument 401 directive, and that on the error page you must use absolute URLs for CSS, images, links and whatnot because relative URIs don’t work there!

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

9 comments Sebastian | Testing, MSN, Search Quality, Crap, Yahoo, .htaccess, Google

Comment rating and filtering with SezWho

Posted on 31 January, 2008

I’ve added SezWho to the comment area. SezWho enables rating and filtering of comments, and shows you even comments an author has left on other blogs. Neat.

Currently there are no ratings, so the existing comments are all rated 2.5 (quite good). Once you’ve rated a few comments, you can suppress all lower quality comments (rated below 3), or show high quality comments only (rated 4 or better).

Don’t freak out when you use CSS that highlights nofollow’ed links. SezWho manipulates the original (mostly dofollow’ed) author link with JavaScript, hence search engines still recognize that a link shall pass PageRank and anchor text. (I condomize some link drops, for example when I don’t know a site and can’t afford the time to check it out, see my comments policy.)

I’ll ask SezWho to change that when I’m more familiar with their system (I hate change requests based on a first peek). SezWho should look at the attributes of the original link in order to add rel=”nofollow” to the JS created version only when the blogger actually has condomized a particular link. Their software changes the comment author URL to a JS script that redirects visitors to the URL the commenter has submitted. It would be nice to show the original URL in the status bar on mouse over.

Also, it seems that when you sign up with SezWho, they remove the trailing slash from your blog’s URL. That’s not acceptable. I mean not every startup should do what clueless Yahoo developers still do although they know that it violates several Web standards. Removing trailing slashes from links is not cool, that’s a crappy manipulation that can harm search engine rankings, will lead to bandwidth theft when bots follow castrated links only to get redirected, … ok, ok, ok … that’s stuff for another ~~post~~ rant. Judging from their Web site, SezWho looks like a decent operation, so I’m sure they can change that too.

SezWho sidebar widget

I’ve not yet added the widgets, above is how they would appear in the sidebar.

I consider SezWho useful. All functionality lives in the blog and can access the blog’s database, so in theory it doesn’t slow down the page load time by pulling loads of data from 3rd party sources. Please let me know whether you like it or not. Thanks!

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

6 comments Sebastian | Usability, Blogging, Testing

Sorry Aaron Wall - I fucked up

Posted on 29 January, 2008

I am sorry, Aaron Wall My somewhat sarcastic post “Avoiding the well known #4 penalty“, where I joked about a possible Google #6 filter and criticized the SEO/Webmaster community for invalid methods of dealing with SERP anomalies, reads like “Aaron Wall is a clueless douche-bag”. Of course that’s not true, I never thought that, and I apologize for damaging Aaron’s reputation so thoughtlessly.

To express that I believe Aaron is a smart and very nice guy, I link to his related great post about things SEOs can learn from search engine bugs and glitches:

Do You Care About Google Glitches? Excerpt:

Glitches reveal engineer intent. And they do it early enough that you have time to change your strategy before your site is permanently filtered or banned. When you get to Google’s size, market share, and have that much data, glitches usually mean something.

To make my point clear: calling a SERP anomaly a filter or penalty unless its intents and causes are properly analyzed, and this analyze is backed up with a reasonable data set, is as thoughtlessly as damaging a fellow SEOs reputation in a way that someone new to the field reading my post and/or comments at Sphinn must think that I’m poking Aaron, although I’m just sick of the almost daily WMW penalty inventions (WMW members -not Aaron!- invented the “Google position #6 penalty / filter” term). The sole reason for mentioning Aaron in my post was that his post (also read this one) triggered a great discussion at Sphinn that I’ve cited in parts.

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

4 comments Sebastian | Folks, Crap, SEO

Google removes the #6 penalty/filter/glitch

Posted on 29 January, 2008

Google removed the position six penalty After the great #6 Penalty SEO Panel Google’s head of the webspam dept. Matt Cutts digged out a misbehaving algo and sent it back to the developers. Two hours ago he stated:

When Barry asked me about “position 6″ in late December, I said that I didn’t know of anything that would cause that. But about a week or so after that, my attention was brought to something that could exhibit that behavior.

We’re in the process of changing the behavior; I think the change is live at some datacenters already and will be live at most data centers in the next few weeks.

So everything is fine now. Matt penalizes the position-six software glitch, and lost top positions will revert to their former rankings in a while. Well, not really. Nobody will compensate income losses, nor the time Webmasters spent on forums discussing a suspected penalty that actually was a bug or a weird side effect. However, kudos to Google for listening to concerns, tracking down and fixing the algo. And thanks for the update, Matt.

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

4 comments Sebastian | Search Quality, SEO, Google

Avoiding the well known #4 SERP-hero-penalty …

Posted on 25 January, 2008

Seb the red claw … I just have to link to North South Media’s neat collection of Search Action Figures.

Paul pretty much dislikes folks who don’t link to him, so Danny Sullivan and Rand Fishkin are well advised to drop a link every now and then, and David Naylor better gives him an interview slot asap.

Google’s numbered “penalties”, esp. #6

As for numeric penalties in general … repeat("Sigh", ∞) … enjoy this brains trust moderated by Marty Weintraub (unauthorized):

Marty: Folks, please welcome Aaron Wall, who recently got his #6 penalty removed!

Audience: clap(26) sphinn(26)

The Gypsy: Sorry Marty but come on… this is complete BS and there is NO freakin #6 filter just like the magical minus 90…900 bla bla bla. These anomalies NEVER have any real consensus on a large enough data set to even be considered a viable theory.

A Red Crab: As long as Bill can’t find a plus|minus-n-raise|penalty patent, or at least a white paper or so leaked out from Google, or for all I care a study that provides proof instead of weird assumptions based on claims of webmasters jumping on todays popular WMW band wagon that aren’t plausible nor verifiable, such beasts don’t exist. There are unexplained effects that might look like a pattern, but in most cases it makes no sense to gather a few examples coming with similarities because we’ll never reach the critical mass of anomalies to discuss a theory worth more than a thumbs-down click.

Marty: Maybe Aaron is joking. Maybe he thinks he has invented the next light bulb.

Gamermk: Aaron is grasping at straws on this one.

Barry Welford: I would like this topic to be seen by many.

Audience: clap(29) sphinn(29)

The Gypsy: It is just some people that have DECIDED on an end result and trying to make various hypothesis fit the situation (you know, like tobacco lobby scientists)… this is simply bad form IMO.

Danny Sullivan: Well, I’ve personally seen this weirdness. Pages that I absolutely thought “what on earth is that doing at six” rather than at the top of the page. Not four, not seven — six. It was freaking weird for several different searches. Nothing competitive, either.

I don’t know that sixth was actually some magic number. Personally, I’ve felt like there’s some glitch or problem with Google’s ranking that has prevented the most authorative page in some instances from being at the top. But something was going on.

Remember, there’s no sandbox, either. We got that for months and months, until eventually it was acknowledge that there were a range of filters that might produce a “sandbox like” effect.

The biggest problem I find with these types of theories is they often start with a specific example, sometimes that can be replicated, then they become a catch-all. Not ranking. Oh, it’s the sandbox. Well no — not if you were an established site, it wasn’t. The sandbox was typicaly something that hit brand new sites. But it became a common excuse for anything, producing confusion.

Jim Boykin: I’ll jump in and say I truely believe in the 6 filter. I’ve seen it. I wouldn’t have believed it if I hadn’t seen it happen to a few sites.

Audience: clap(31) sphinn(31)

A Red Crab: Such terms tend to become a life of their own, IOW an excuse for nearly every way a Webmaster can fuck up rankings. Of course Google’s query engine has thresholds (yellow cards or whatever they call them) that don’t allow some sites to rank above a particular position, but that’s a symtom that doesn’t allow back-references to a particular cause, or causes. It’s speculation as long as we don’t know more.

IncrediBill: I definitely believe it’s some sort of filter or algo tweak but it’s certainly not a penalty which is why I scoff at calling it such. One morning you wake up and Matt has turned all the dials to the left and suddenly some criteria bumps you UP or DOWN. Sites have been going up and down in Google SERPs for years, nothing new or shocking about that and this too will have some obvious cause and effect that could probably be identified if people weren’t using the shotgun approach at changing their site

G1smd: By the time anyone works anything out with Google, they will already be in the process of moving the goalposts to another country.

Slightly Shady SEO: The #6 filter is a fallacy.

Old School: It certainly occured but only affected certain sites.

Danny Sullivan: Perhaps it would have been better called a -5 penalty. Consider. Say Google for some reason sees a domain but decides good, but not sure if I trust it. Assign a -5 to it, and that might knock some things off the first page of results, right?

Look — it could all be coincidence, and it certainly might not necessarily be a penalty. But it was weird to see pages that for the life of me, I couldn’t understand why they wouldn’t be at 1, showing up at 6.

Slightly Shady SEO: That seems like a completely bizarre penalty. Not Google’s style. When they’ve penalized anything in the past, it hasn’t been a “well, I guess you can stay on the frontpage” penalty. It’s been a smackdown to prove a point.

Matt Cutts: Hmm. I’m not aware of anything that would exhibit that sort of behavior.

Audience: Ugh … oohhhh … you weren’t aware of the sandbox, either!

Danny Sullivan: Remember, there’s no sandbox, either. We got that for months and months, until eventually it was acknowledge that there were a range of filters that might produce a “sandbox like” effect.

Audience: Bah, humbug! We so want to believe in our lame excuses …

Tedster: I’m not happy with the current level of analysis, however, and definitely looking for more ideas.

Audience: clap(40) sphinn(40)

Of course the panel above is fictional, respectively assembled from snippets which in some cases change the message when you read them in their context. So please follow the links.

I wouldn’t go that far to say there’s no such thing as a fair amount of Web pages that deserve a #1 spot on Google’s SERPs, but rank #6 for unknown reasons (perhaps link monkey business, staleness, PageRank flow in disarray, anchor text repetitions, …). There’s something worth investigating.

However, I think that labelling a discussion of glitches or maybe filters that don’t behave based on a way too tiny dataset “#6 penalty” leads to the lame excuse for literally anything phenomenon.

Folks who don’t follow the various threads closely enough to spot the highly speculative character of the beast, will take it as fact and switch to winter sleep mode instead of enhancing their stuff like Aaron did. I can’t wait for the first “How to escape the Google -5 penalty” SEO tutorial telling the great unwashed that a “+5″ revisit-after meta tag will heal it.

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

18 comments Sebastian | Ego Food, Folks, Fun, Crap, SEO, Google

Dealing with spamming content thieves / plagiarists (oylinki.com)

Posted on 21 January, 2008

Dealing with plagiarists When it comes to crap like plagiarism you shouldn’t consider me a gentleman.

If assclowns like Veronica Domb steal my content and publish it along with likewise stolen comments on their blatantly spamming site oylinki.com, I’m somewhat upset.

Then when I leave a polite note asking the thief Veronica Domb from EmeryVille to remove my stuff asap, see my comment marked as “in moderation”, but neither my content gets removed nor my comment is published within 24 hours, I stay annoyed.

When I’m annoyed, I write blog posts like this one. I’m sure it will rank high enough for [Veronica Domb] when the assclown’s banker or taxman searches for her name. I’m sure it’ll be visible on any SERP that any other (potential) business partner submits at a major search engine.

Content Thieves Veronica Domb et al, P.O.BOX 99800, EmeryVille, 94662, CA are blatant spammers

Hey, outing content thieves is way more fun than filing boring DMCA complaints, and way more effective. Plagiarists do ego searches too, and from now on Veronica Domb from EmeryVille will find the footsteps of her criminal activities on the Web with each and every ego search. Isn’t that nice?

Not. Of course Veronica Domb is a pseudonym of Slade Kitchens, Jamil Akhtar, … However, some plagiarists and scam artists aren’t smart enough to hide their identity, so watch out.

Maybe I’ve done some companies a little favor, because they certainly don’t need to sent out money sneakily “earned” with Web spam and criminal activities that violate the TOS of most affiliate programs.

AdBrite will love to cancel the account for these affiliate links: http://ads.adbrite.com/mb/text_group.php?sid=448245&br=1 &dk=736d616c6c20627573696e6573735f355f315f776562 http://www.adbrite.com/mb/commerce/purchase_form.php?opid=448245&afsid=1

Google’s webspam team as well as other search engines will most likely delist oylinki.com that comes with 100% stolen text and links and faked whois info as well.

Spamcop and alike will happily blacklist oylinki.com (IP: 66.199.174.80 , cwh2.canadianwebhosting.com) because the assclown’s blog software sends out email spam masked as trackbacks.

If anybody is interested, here’s a track of the real “Veronica Domb” from Canada clicking the link to this post from her WP admin panel: 74.14.107.36 - - [21/Jan/2008:07:50:40 -0500] "GET /outing-plagiarist-2008-01-21/ HTTP/1.1" 200 9921 "http://oylinki.com/blog/wp-admin/edit-comments.php" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; SU 3.005; .NET CLR 1.1.4322; InfoPath.1; Alexa Toolbar; .NET CLR 2.0.50727)"

Common sense is not as common as you think.

Disclaimer: I’ve outed plagiarists in the past, because it works. Whether you do that on ego-SERPs or not depends on your ethics. Some folks think that’s even worse than theft and spamming. I say that publishing plagiarisms in the first place deserves bad publicity.

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

11 comments Sebastian | Webspam, Blogging, Spam, Copyrights, Plagiarism, Crap

Get a grip on the Robots Exclusion Protocol (REP)

Posted on 17 January, 2008

REP command hierarchy Thanks to the very nice folks over at SEOmoz I was able to prevent this site from becoming a kind of REP/robots.txt blog. Please consider reading this REP round up:

Robots Exclusion Protocol 101

My REP 101 links to the various standards (robots.txt, REP tags, Sitemaps, microformats) the REP consists of, and provides a rough summary of each REP component. It explains the difference between crawler directives and indexer directives, and which command hierarchy search engines follow when REP directives put in different levels conflict.

Educate yourself on the REP Why do I think that solid REP knowledge is important right now? Not only because of the confusion that exists thanks to the volume of crappy advice provided at every Webmaster hangout. Of course understanding the REP makes webmastering easier, thus I’m glad when my REP related pamphlets are considered somewhat helpful.

I’ve a hidden agenda, though. I predict that the REP is going to change shortly. As usual, its evolvement is driven by a major search engine, since the W3C and such organizations don’t bother with the conglomerate of quasi standards and RFCs known as the Robots Exclusion Protocol. In general that’s not a bad thing. Search engines deal with the REP every day, so they have a legitimate interest.

Unfortunately not every REP extension that search engines have invented so far is useful for Webmasters, some of them are plain crap. Learning from fiascos and riots of the past, the engines are well advised to ask Webmasters for feedback before they announce further REP directives.

I’ve a feeling that shortly a well known search engine will launch a survey regarding particular REP related ideas. I want that Webmasters are well aware of the REP’s complexity and functionality when they contribute their take on REP extensions. So please educate yourself.

My pamphlet discussing a possible standardization of REP tags as robots.txt directives could be a useful reference, also please watch the great video here.

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

2 comments Sebastian | URL removal, Web development, X-Robots-Tag, Robots Meta Tags, Crawler Directives, SEO, robots.txt, XML-Sitemaps, Microformats

Do search engines index references to password protected smut?

Posted on 16 January, 2008

Recently Matt Cutts said that Google doesn’t index password protected content. I wasn’t sure whether or not that goes for all search engines. I thought that they might index at least references to protected URLs, like they all do with other uncrawlable content that has strong inbound links.

Well, SEO tests are dull and boring, so I thought I could have some fun with this one.

I’ve joked that I should use someone’s favorite smut collection to test it. Unfortunately, nobody was willing to trade porn passwords for link love or so. I’m not a hacker, hence I’ve created my own tiny collection of password protected SEO porn (this link is not exactly considered safe at work) as test case.

I was quite astonished that according to this post about SEO porn next to nobody in the SEOsphere optimizes adult sites (of course that’s not true). From the comments I figured that some folks at least ~~surf for SEO porn~~ evaluate the optimization techniques applied by adult Webmasters.

Ok, lets extend that. Out yourself as SEO porn savvy Internet marketer. Leave your email addy in the comments (dont forget to tell me why I should believe that you’re over 18), and I’ll email you the super secret password for my SEO porn members area (!SAW). Trust me, it’s worth it, and perfectly legit due to the strictly scientific character of this experiment. If you’re somewhat shy, use a funny pseudonym.

I’d very much appreciate a little help with linkage too. Feel free to link to http://sebastians-pamphlets.com/porn/ with an adequate anchor text of your choice, and of course without condom.

Get the finest SEO porn available on this planet!

I’ve got the password, now let me in!

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

38 comments Sebastian | Testing, Search Quality, Fun, SEO

Getting URLs outta Google - the good, the popular, and the definitive way

Posted on 14 January, 2008

Keep out Google There’s more and more robots.txt talk in the SEOsphere lately. That’s a good thing in my opinion, because the good old robots.txt’s power is underestimated. Unfortunately it’s quite often misused or even abused too, usually because folks don’t fully understand the REP (by following “advice” from forums instead of reading the real thing, or at least my stuff ).

The good way: If the major search engines would support new robots.txt directives that Webmasters really need, removing even huge chunks of content from Google’s SERPs -without collateral damage- via robots.txt would be a breeze.
The popular way: Shamelessly stealing Matt’s official advice [Source: Remove your content from Google by Matt Cutts]. To obscure the blatant plagiarism, I’ll add a few thoughts.
The definitive way: Of course that’s not the ultimate way, but that’s the way Google’s cookies crumble, currently. In other words: Google is working on a leaner approach, but that’s not yet announced, thus you can’t use it; you still have to jump through many hoops.

The good way

Caution: Don’t implement code from this section, the robots.txt directives discussed here are not (yet/fully) supported by search engines!

Currently all robots.txt statements are crawler directives. That means that they can tell behaving search engines how to crawl a site (fetching contents), but they’ve no impact on indexing (listing contents on SERPs). I’ve recently published a draft discussing possible REP tags for robots.txt. REP tags are indexer directives known from robots meta tags and X-Robots-Tags, which -as on-page respectively per-URL directives- require crawling.

The crux is that REP tags must be assigned to URLs. Say you’ve a gazillion of printer friendly pages in various directories that you want to deindex at Google, putting the “noindex,follow,noarchive” tags comes with a shitload of work.

How cool would be this robots.txt code instead: Noindex: /*printable Noarchive: /*printable
Search engines would continue to crawl, but deindex previously indexed URLs respectively not index new URLs from /articles/printable/*.htm /manuals/printable/*.pdf /products/descriptions/*.php?format=printable&product=* ...
provided those URLs aren’t disallow’ed. They would follow the links in those documents, so that PageRank gathered by printer friendly pages wouldn’t be completely wasted. To apply an implicit rel-nofollow to all links pointing to printer friendly pages, so that those can’t accumulate PageRank from internal or external links, you’d add Norank: /*printable
to the robots.txt code block above.

If you don’t like that search engines index stuff you’ve disallow’ed in your robots.txt from 3rd party signals like inbound links, and that Google accumulates even PageRank for disallow’ed URLs, you’d put: Disallow: /unsearchable/ Noindex: /unsearchable/ Norank: /unsearchable/

To fix URL canonicalization issues with PHP session IDs and other tracking variables you’d write for example Truncate-variable sessionID: /
and that would fix the duplicate content issues as well as the problem with PageRank accumulated by throw-away URLs.

Unfortunately, robots.txt is not yet that powerful, so please link to the REP tags for robotx.txt “RFC” to make it popular, and proceed with what you have at the moment.

The popular way

Matt Cutts was kind enough to discuss Google’s take on contents excluded from search engine indexing in 10 minutes or less here:

You really should listen, the video isn’t that long.

Don’t link (very weak): Although Google usually doesn’t index unlinked stuff, this can happen due to crawling based on sitemaps. Also, the URL might appear in linked referrer stats on other sites that are crawlable, and folks can link from the cold.
.htaccess / .htpasswd (Matt’s first recommendation): Since Google cannot crawl password protected contents, Matt declares this method to prevent content from indexing safe. I’m not sure what will happen when I spread a few strong links to somebody’s favorite smut collection, perhaps I’ll test some day whether Google and other search engines list such a reference on their SERPs.
robots.txt (weak): Matt rightly points out that Google’s cool robots.txt validator in the Webmaster Console is a great tool to develop, test and deploy proper robots.txt syntax that effectively blocks search engine crawling. The weak point is, that even when search engines obey robots.txt, they can index uncrawled content from 3rd party sources. Matt is proud of Google’s smart capabilities to figure out suiteble references like the ODP. I agree totally and wholeheartedly. Hence robots.txt in its current shape doesn’t prevent content from showing up in Google and other engines as well. Matt didn’t mention Google’s experiments with Noindex: support in robots.txt, which need improvement but could resolve this dilemma.
Robots meta tags (Google only, weak with MSN/Yahoo): The REP tag “noindex” in a robots meta element prevents from indexing, and, once spotted, deindexes previously listed stuff - at least at Google. According to Matt Yahoo and MSN still list such URLs as references without snippets. Because only Google obeys “noindex” totally by wiping out even URL-only listings and foreign references, robots meta tags should be considered a kinda weak approach too. Also, search engines must crawl a page to discover this indexer directive. Matt adds that robots meta tags are problematic, because they’re buried on the pages and sometimes tend to get forgotten when no longer needed (Webmasters ~~might~~ do forget to take the tag down, respectively add it later on when search engines policies change, or work in progress gets released respectively outdated contents are taken down). Matt forgot to mention the neat X-Robots-Tags that can be used to apply REP tags in the HTTP header of non-HTML resources like images or PDF documents. Google’s X-Robots-Tag is supported by Yahoo too.
Rel-nofollow (kind of weak): Although condoms totally remove links from Google’s link graphs, Matt says that rel-nofollow should not be used as crawler or indexer directive. Rel-nofollow is for condomizing links only, also other search engines do follow nofollow’ed links and even Google can discover the link destination from other links they gather on the Web, or grab from internal links inadvertently lacking a link condom. Finally, rel-nofollow requires crawling too.
URL removal tool in GWC (Matt’s second recommendation): Taking Matt’s enthusiasm while talking about Google’s neat URL terminator into account, this one should be considered his first recommendation. Google provides tools to remove URLs from their search index since five years at least (way longer IIRC). Recently the Webmaster Central team has integrated those, as well as new functionality, into the Webmaster Console, donating it a very nice UI. The URL removal tools come with great granularity, and because the user’s site ownership is verified, it’s pretty powerful, safe, and shows even the progress for each request (the removal process lasts a few days). Its UI is very flexible and allows even revoking of previous removal requests. The wonderful little tool’s sole weak point is that it can’t remove URLs from the search index forever. After 90 days or possibly six months the erased stuff can pop up again.

Summary: If your site isn’t password protected, and you can’t live with indexing of disallow’ed contents, you must remove unwanted URLs from Google’s search index periodically. However, there are additional procedures that can support -but not guarantee!- deindexing. With other search engines it’s even worse, because those don’t respect the REP like Google, and don’t provide such handy URL removal tools.

The definitive way

Actually, I think Matt’s advice is very good. As long as you don’t need a permanent solution, and if you lack the programming skills to develop such a beast that works with all (major) search engines. I mean everybody can insert a robots meta tag or robots.txt statement, and everybody can semiyearly repeat URL removal requests with the neat URL terminator, but most folks are scared when it comes to conditional manipulation of HTTP headers to prevent stuff from indexing. However, I’ll try to explain quite safe methods that actually work (with Apache, not IIS) in the following examples.

First of all, if you really want that search engines don’t index your stuff, you must allow them to crawl it. And no, that’s not an oxymoron. At the moment there’s no such thing as an indexer directive on site-level. You can’t forbid indexing in robots.txt. All indexer directives require crawling of the URLs that you want to keep out of the SERPs. Of course that doesn’t mean you should serve search engine crawlers a book from each forbidden URL.

Lets start with robots.txt. You put User-agent: * Disallow: /images/ Disallow: /movies/ Disallow: /unsearchable/ User-agent: Googlebot Disallow: Allow: / User-agent: Slurp Disallow: Allow: /
The first section is just a fallback.

(Here comes a rather brutal method that you can use to keep search engines out of particular directories. It’s not suitable to deal with duplicate content, session IDs, or other URL canonicalization. More on that later.)

Next edit your .htaccess file. <IfModule mod_rewrite.c> RewriteEngine On RewriteCond %{REQUEST_URI} ^/unsearchable/ RewriteCond %{REQUEST_URI} !\.php RewriteRule . /unsearchable/output-content.php [L] </IfModule>
If you’ve .php pages in /unsearchable/ then remove the second rewrite condition, put output-content.php into another directory, and edit my PHP code below so that it includes the PHP scripts (don’t forget to pass the query string).

Now grab the PHP code to check for search engine crawlers here and include it below. Your script /unsearchable/output-content.php looks like: <?php @include("crawler-stuff.php"); // defines variables and functions $isSpider = checkCrawlerIP ($requestUri); if ($isSpider) { @header("HTTP/1.1 403 Thou shalt not index this", TRUE, 403); @header("X-Robots-Tag: noindex,noarchive,nosnippet,noodp,noydir"); exit; } $arr = explode("#", $requestUri); $outputFileName = $arr[0]; $arr = explode("?", $outputFileName); $outputFileName = $_SERVER["DOCUMENT_ROOT"] .$arr[0]; if (substr($outputFileName, -1, 1) == "/") { $outputFileName .= "index.html"; } if (file_exists($outputFileName)) { // send the content type header $contentType = "text/plain"; if (stristr($outputFileName, ".html")) $contentType ="text/html"; if (stristr($outputFileName, ".css")) $contentType ="text/css"; if (stristr($outputFileName, ".js")) $contentType ="text/javascript"; if (stristr($outputFileName, ".png")) $contentType ="image/png"; if (stristr($outputFileName, ".jpg")) $contentType ="image/jpeg"; if (stristr($outputFileName, ".gif")) $contentType ="image/gif"; if (stristr($outputFileName, ".xml")) $contentType ="application/xml"; if (stristr($outputFileName, ".pdf")) $contentType ="application/pdf"; @header("Content-type: $contentType"); @header("X-Robots-Tag: noindex,noarchive,nosnippet,noodp,noydir"); readfile($outputFileName); exit; } // That’s not the canonical way to call the 404 error page. Don’t copy, adapt: @header("HTTP/1.1 307 Oups, I displaced $outputFileName", TRUE, 307); @header("Location: http://sebastians-pamphlets.com/404/"); exit; ?>

What does the gibberish above do? In .htaccess we rewrite all requests for resources stored in /unsearchable/ to a PHP script, which checks whether the request is from a search engine crawler or not.

If the requestor is a verified crawler (known IP or IP and host name belong to a major search engine’s crawling engine), we return an unfriendly X-Robots-Tag and an HTTP response code 403 telling the search engine that access to our content is forbidden. The search engines should assume that a human visitor receives the same response, hence they aren’t keen on indexing these URLs. Even if a search engine lists an URL on the SERPs by accident, it can’t tell the searcher anything about the uncrawled contents. That’s unlikely to happen actually, because the X-Robots-Tag forbids indexing (Ask and MSN might ignore these directives).

If the requestor is a human visitor, or an unknown Web robot, we serve the requested contents. If the file doesn’t exist, we call the 404 handler.

With dynamic content you must handle the query string and (expected) cookies yourself. PHP’s readfile() is binary safe, so the script above works with images or PDF documents too.

If you’ve an original search engine crawler coming from a verifiable server feel free to test it with this page (user agent spoofing doesn’t qualify as crawler, come back in a week or so to check whether the engines have indexed the unsearchable stuff linked above).

The method above is not only brutal, it wastes all the juice from links pointing to the unsearchable site areas. To rescue the PageRank, change the script as follows: … $urlThatDesperatelyNeedsPageRank = "http://sebastians-pamphlets.com/about/"; if ($isSpider) { @header("HTTP/1.1 301 Moved permanently", TRUE, 301); @header("Location: $urlThatDesperatelyNeedsPageRank"); exit; } …
This redirects crawlers to the URL that has won your internal PageRank lottery. Search engines will/shall transfer the reputation gained from inbound links to this page. Of course page by page redirects would be your first choice, but when you block entire directories you can’t accomplish this kind of granularity.

By the way, when you remove the offensive 403-forbidden stuff in the script above and change it a little more, you can use it to apply various X-Robots-Tags to your HTML pages, images, videos and whatnot. When a search engine finds an X-Robots-Tag in the HTTP header, it should ignore conflicting indexer directives in robots meta tags. That’s a smart way to steer indexing of bazillions of resources without editing them.

Ok, this was the cruel method; now lets discuss cases where telling crawlers how to behave is a royal PITA, thanks to the lack of indexer directives in robots.txt that provide the required granularity (Truncate-variable, Truncate-value, Order-arguments, …).

Say you’ve session IDs in your URLs. That’s one (not exactly elegant) way to track users or affiliate IDs, but strictly forbidden when the requestor is a search engine’s Web robot.

In fact, a site with unprotected tracking variables is a spider trap that would produce infinite loops in crawling, because spiders following internal links with those variables discover new redundant URLs with each and every fetch of a page. Of course the engines found suitable procedures to dramatically reduce their crawling of such sites, what results in less indexed pages. Besides joyless index penetration there’s another disadvantage - the indexed URLs are powerless duplicates that usually rank beyond the sonic barrier at 1,000 results per search query.

Smart search engines perform high sophisticated URL canonicalization to get a grip on such crap, but Webmasters can’t rely on Google & Co to fix their site’s maladies.

Ok, we agree that you don’t want search engines to index your ugly URLs, duplicates, and whatnot. To properly steer indexing, you can’t just block the crawlers’ access to URLs/contents that shouldn’t appear on SERPs. Search engines discover most of those URLs when following links, and that means that they’re ready to assign PageRank or other scoring of link popularity to your URLs. PageRank / linkpop is a ranking factor you shouldn’t waste. Every URL known to search engines is an asset, hence handle it with care. Always bother to figure out the canonical URL, then do a page by page permanent redirect (301).

For your URL canonicalization you should have an include file that’s available at the very top of all your scripts, executed before PHP sends anything to the user agent (don’t hack each script, maintaining so many places handling the same stuff is a nightmare, and fault-prone). In this include file put the crawler detection code and your individual routines that handle canonicalization and other search engine friendly cloaking routines.

View a Code example (stripping useless query string variables).<?php @include("crawler-stuff.php"); // defines variables and functions $isSpider = checkCrawlerIP ($requestUri); if ($isSpider) { $canonicalServerName = "sebastians-pamphlets.com"; $doRedirect = FALSE; $killVars = array("sessionID", "affiliateID", "format", "nav"); $arr = explode("#", $requestUri); $uri = $arr[0]; $arr = explode("?", $uri); $uri = $arr[0]; if ($queryString) { $qs = str_replace("&", "&", $queryString) ."&"; $qsArr = explode("&", $qs); $qs = ""; foreach ($qsArr as $qsArg) { $arr = explode("=", $qsArg); $var = $arr[0]; $val = $arr[1]; if (empty($val)) { $doRedirect = TRUE; } foreach ($killVars as $killVar) { if ("$killVar" == "$var") { $var = ""; $doRedirect = TRUE; } } if (!empty($var) && !empty($val)) { if (!empty($qs)) $qs .= "&"; $qs .= $var ."=" .$val; } } if (!empty($qs)) { $uri .= "?" .$qs; } } if ($doRedirect) { $canonicalUrl = "http://" .$canonicalServerName .$uri; @header("HTTP/1.1 301 Moved permanently", TRUE, 301); @header("Location: $canonicalUrl"); exit; } } // isSpider ?>

How you implement the actual canonicalization routines depends on your individual site. I mean, if you’ve not the coding skills necessary to accomplish that you wouldn’t read this entire section, wouldn’t you?

Here are a few examples of pretty common canonicalization issues:

Session IDs and other stuff used for user tracking
Affiliate IDs and IDs used to track the referring traffic source
Empty values of query string variables
Query string arguments put in different order / not checking the canonical sequence of query string arguments (ordering them alphabetically is always a good idea)
Redundant query string arguments
URLs longer than 255 bytes
Server name confusion, e.g. subdomains like “www”, “ww”, “random-string” all serving identical contents from example.com
Case issues (IIS/clueless code monkeys handling GET-variables/values case-insensitive)
Spaces, punctuation, or other special characters in URLs
Different scripts outputting identical contents
Flawed navigation, e.g. passing the menu item to the linked URL
Inconsistent default values for variables expected from cookies
Accepting undefined query string variables from GET requests
Contentless pages, e.g. outputted templates when the content pulled from the database equals whitespace or is not available
…

Summary

Hiding contents from all search engines requires programming skills that many sites can’t afford. Even leading search engines like Google don’t provide simple and suitable ways to deindex content -respectively to prevent content from indexing- without collateral damage (lost/wasted PageRank). We desperately need better tools. Maybe my robots.txt extensions are worth an inspection.

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

26 comments Sebastian | URL removal, Web development, Duplicate Content, X-Robots-Tag, Robots Meta Tags, Crawler Directives, SEO, robots.txt, Webmaster Central, Google

« Previous Page 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 Next Page »

JudgeDread: Gerhard Heinz is a phoney. If you read his dribble going back months, most of predictions have not been...
mcc_FREE_LIBYA: Hi Sebastian, I moved with my maps 2 another host. Much better navigation now. Now u find my maps...
Sebastian: Alireza Sefati, I’m replying to your comment from a totally clean, er… M$-free computer. ;-)
Alireza Sefati: I think you maybe the last geek on earth promoting IE 9 or any of MS products :)
45south45: Hi again Sebastian So glad to see the reference to mcc’s maps here. They are very helpful for all of...

Sebastian’s Pamphlets

Why storing URLs with truncated trailing slashes is an utterly idiocy

What is a trailing slash?

How to rescue stolen trailing slashes

Why stealing trailing slashes is not cool

Exceptions

How URL canonicalization seems to irritate Technorati

The hacker tool MSN-LiveSearch is responsible for brute force attacks

Google

Yahoo

Ask

MSN LiveSearch

Summary

Comment rating and filtering with SezWho

Sorry Aaron Wall - I fucked up

Google removes the #6 penalty/filter/glitch

Avoiding the well known #4 SERP-hero-penalty …

Google’s numbered “penalties”, esp. #6

Dealing with spamming content thieves / plagiarists (oylinki.com)

Content Thieves Veronica Domb et al, P.O.BOX 99800, EmeryVille, 94662, CA are blatant spammers

Get a grip on the Robots Exclusion Protocol (REP)

Do search engines index references to password protected smut?

Getting URLs outta Google - the good, the popular, and the definitive way

The good way

The popular way

The definitive way

Summary

INTERNET MARKETING NINJAS

Jump Center

My recent pamphlets

Your Valued Comments

Archived Pamphlets

Paid Links, err Recommendations

Charity Links

Hardcore SEO Porn

Sexy T-Shirts ...

What I'm Doing...

Sebastian's picked gems

Stuff4Bots