Why storing URLs with truncated trailing slashes is an utterly idiocy

Yahoo steals my trailing slashesWith some Web services URL canonicalization has a downside. What works great for major search engines like Google can fire back when a Web service like Yahoo thinks circumcising URLs is cool. Proper URL canonicalization might, for example, screw your blog’s reputation at Technorati.

In fact the problem is not your URL canonicalization, e.g. 301 redirects from http://example.com to http://example.com/ respectively http://example.com/directory to http://example.com/directory/, but crappy software that removes trailing forward slashes from your URLs.

Dear Web developers, if you really think that home page locations respectively directory URLs look way cooler without the trailing slash, then by all means manipulate the anchor text, but do not manipulate HREF values, and do not store truncated URLs in your databases (not that “http://example.com” as anchor text makes any sense when the URL in HREF points to “http://example.com/”). Spreading invalid URLs is not funny. People as well as Web robots take invalid URLs from your pages for various purposes. Many usages of invalid URLs are capable to damage the search engine rankings of the link destinations. You can’t control that, hence don’t screw our URLs. Never. Period.

Folks who don’t agree with the above said read on.

    TOC:

  • What is a trailing slash? About URLs, directory URIs, default documents, directory indexes, …
  • How to rescue stolen trailing slashes About Apache’s handling of directory requests, and rewriting respectively redirecting invalid directory URIs in .htaccess as well as in PHP scripts.
  • Why stealing trailing slashes is not cool Truncating slashes is not only plain robbery (bandwidth theft), it often causes malfunctions at the destination server and 3rd party services as well.
  • How URL canonicalization irritates Technorati 301 redirects that “add” a trailing slash to directory URLs, respectively virtual URIs that mimic directories, seem to irritate Technorati so much that it can’t compute reputation, recent post lists, and so on.

What is a trailing slash?

The Web’s standards say (links and full quotes): The trailing path segment delimiter “/” represents an empty last path segment. Normalization should not remove delimiters when their associated component is empty. (Read the polite “should” as “must”.)

To understand that, lets look at the most common URL components:
scheme:// server-name.tld /path ?query-string #fragment
The (red) path part begins with a forward slash “/” and must consist of at least one byte (the trailing slash itself in case of the home page URL http://example.com/).

If an URL ends with a slash, it points to a directory’s default document, or, if there’s no default document, to a list of objects stored in a directory. The home page link lacks a directory name, because “/” after the TLD (.com|net|org|…) stands for the root directory.

Automated directory indexes (a list of links to all files) should be forbidden, use Options -Indexes in .htaccess to send such requests to your 403-Forbidden page.

In order to set default file names and their search sequence for your directories use DirectoryIndex index.html index.htm index.php /error_handler/missing_directory_index_doc.php. In this example: on request of http://example.com/directory/ Apache will first look for /directory/index.html, then if that doesn’t exist for /directory/index.htm, then /directory/index.php, and if all that fails, it will serve an error page (that should log such requests so that the Webmaster can upload the missing default document to /directory/).

The URL http://example.com (without the trailing slash) is invalid, and there’s no specification telling a reason why a Web server should respond to it with meaningful contents. Actually, the location http://example.com points to Null  (nil, zilch, nada, zip, nothing), hence the correct response is “404 - we haven’t got ‘nothing to serve’ yet”.

The same goes for sub-directories. If there’s no file named “/dir”, the URL http://example.com/dir points to Null too. If you’ve a directory named “/dir”, the canonical URL http://example.com/dir/ either points to a directory index page (an autogenerated list of all files) or the directory’s default document “index.(html|htm|shtml|php|…)”. A request of http://example.com/dir –without the trailing slash that tells the Web server that the request is for a directory’s index– resolves to “not found”.

You must not reference a default document by its name! If you’ve links like http://example.com/index.html you can’t change the underlying technology without serious hassles. Say you’ve a static site with a file structure like /index.html, /contact/index.html, /about/index.html and so on. Tomorrow you’ll realize that static stuff sucks, hence you’ll develop a dynamic site with PHP. You’ll end up with new files: /index.php, /contact/index.php, /about/index.php and so on. If you’ve coded your internal links as http://example.com/contact/ etc. they’ll still work, without redirects from .html to .php. Just change the DirectoryIndex directive from “… index.html … index.php …” to “… index.php … index.html …”. (Of course you can configure Apache to parse .html files for PHP code, but that’s another story.)

It seems that truncating default document names can make sense for services that deal with URLs, but watch out for sites that serve different contents under various extensions of “index” files (intentionally or not). I’d say that folks submitting their ugly index.html files to directories, search engines, top lists and whatnot deserve all the hassles that come with later changes.

How to rescue stolen trailing slashes

Since Web servers know that users are faulty by design, they jump through a couple of resource burning hoops in order to either add the trailing slash so that relative references inside HTML documents (CSS/JS/feed links, image locations, HREF values …) work correctly, or apply voodoo to accomplish that without (visibly) changing the address bar.

With Apache, DirectorySlash On enables this behavior (check whether your Apache version does 301 or 302 redirects, in case of 302s find another solution). You can also rewrite invalid requests in .htaccess when you need special rules:
RewriteEngine on
RewriteBase /content/
RewriteRule ^dir1$ http://example.com/content/dir1/ [R=301,L]
RewriteRule ^dir2$ http://example.com/content/dir2/ [R=301,L]

With content management systems (CMS) that generate virtual URLs on the fly, often there’s no other chance than hacking the software to canonicalize invalid requests. To prevent search engines from indexing invalid URLs that are in fact duplicates of canonical URLs, you’ll perform permanent redirects (301).

Here is a WordPress (header.php) example:
$requestUri = $_SERVER["REQUEST_URI"];
$queryString = $_SERVER["QUERY_STRING"];
$doRedirect = FALSE;
$fileExtensions = array(".html", ".htm", ".php");
$serverName = $_SERVER["SERVER_NAME"];
$canonicalServerName = $serverName;
 
// if you prefer http://example.com/* URLs remove the "www.":
$srvArr = explode(".", $serverName);
$canonicalServerName = $srvArr[count($srvArr) - 2] ."." .$srvArr[count($srvArr) - 1];
 
$url = parse_url ("http://" .$canonicalServerName .$requestUri);
$requestUriPath = $url["path"];
if (substr($requestUriPath, -1, 1) != "/") {
$isFile = FALSE;
foreach($fileExtensions as $fileExtension) {
if ( strtolower(substr($requestUriPath, strlen($fileExtension) * -1, strlen($fileExtension))) == strtolower($fileExtension) ) {
$isFile = TRUE;
}
}
if (!$isFile) {
$requestUriPath .= "/";
$doRedirect = TRUE;
}
}
$canonicalUrl = "http://" .$canonicalServerName .$requestUriPath;
if ($queryString) {
$canonicalUrl .= "?" . $queryString;
}
if ($url["fragment"]) {
$canonicalUrl .= "#" . $url["fragment"];
}
if ($doRedirect) {
@header("HTTP/1.1 301 Moved Permanently", TRUE, 301);
@header("Location: $canonicalUrl");
exit;
}

Check your permalink settings and edit the values of $fileExtensions and $canonicalServerName accordingly. For other CMSs adapt the code, perhaps you need to change the handling of query strings and fragments. The code above will not run under IIS, because it has no REQUEST_URI variable.

Why stealing trailing slashes is not cool

This section expressed in one sentence: Cool URLs don’t change, hence changing other people’s URLs is not cool.

Folks should understand the “U” in URL as unique. Each URL addresses one and only one particular resource. Technically spoken, if you change one single character of an URL, the altered URL points to a different resource, or nowhere.

Think of URLs as phone numbers. When you call 555-0100 you reach the switchboard, 555-0101 is the fax, and 555-0109 is the phone extension of somebody. When you steal the last digit, dialing 555-010, you get nowhere.

Yahoo'ish fools steal our trailing slashesOnly a fool would assert that a phone number shortened by one digit is way cooler than the complete phone number that actually connects somewhere. Well, the last digit of a phone number and the trailing slash of a directory link aren’t much different. If somebody hands out an URL (with trailing slash), then use it as is, or don’t use it at all. Don’t “prettify” it, because any change destroys its serviceability.

If one requests a directory without the trailing slash, most Web servers will just reply to the user agent (brower, screen reader, bot) with a redirect header telling that one must use a trailing slash, then the user agent has to re-issue the request in the formally correct way. From a Webmaster’s perspective, burning resources that thoughtlessly is plain theft. From a user’s perspective, things will often work without the slash, but they’ll be quicker with it. “Often” doesn’t equal “always”:

  • Some Web servers will serve the 404 page.
  • Some Web servers will serve the wrong content, because /dir is a valid script, virtual URI, or page that has nothing to do with the index of /dir/.
  • Many Web servers will respond with a 302 HTTP response code (Found) instead of a correct 301-redirect, so that most search engines discovering the sneakily circumcised URL will index the contents of the canonical URL under the invalid URL. Now all search engine users will request the incomplete URL too, running into unnecessary redirects.
  • Some Web servers will serve identical contents for /dir and /dir/, that leads to duplicate content issues with search engines that index both URLs from links. Most Web services that rank URLs will assign different scorings to all known URL variants, instead of accumulated rankings to both URLs (which would be the right thing to do, but is technically, well, challenging).
  • Some user agents can’t handle (301) redirects properly. Exotic user agents might serve the user an empty page or the redirect’s “error message”, and Web robots like the crawlers sent out by Technorati or MSN-LiveSearch hang up respectively process garbage.

Does it really make sense to maliciously manipulate URLs just because some clueless developers say “dude, without the slash it looks way cooler”? Nope. Stealing trailing slashes in general as well as storing amputated URLs is a brain dead approach.

KISS (keep it simple, stupid) is a great principle. “Cosmetic corrections” like trimming URLs add unnecessary complexity that leads to erroneous behavior and requires even more code tweaks. GIGO (garbage in, garbage out) is another great principle that applies here. Smart algos don’t change their inputs. As long as the input is processible, they accept it, otherwise they skip it.

Exceptions

URLs in print, radio, and offline in general, should be truncated in a way that browsers can figure out the location - “domain.co.uk” in print and “domain dot co dot uk” on radio is enough. The necessary redirect is cheaper than a visitor who doesn’t type in the canonical URL including scheme, www-prefix, and trailing slash.

How URL canonicalization seems to irritate Technorati

Due to the not exactly responsively (respectively swamped) Technorati user support parts of this section should be interpreted as educated speculation. Also, I didn’t research enough cases to come to a working theory. So here is just the story “how Technorati fails to deal with my blog”.

When I moved my blog from blogspot to this domain, I’ve enhanced the faulty WordPress URL canonicalization. If any user agent requests http://sebastians-pamphlets.com it gets redirected to http://sebastians-pamphlets.com/. Invalid post/page URLs like http://sebastians-pamphlets.com/about redirect to http://sebastians-pamphlets.com/about/. All redirects are permanent, returning the HTTP response code “301″.

I’ve claimed my blog as http://sebastians-pamphlets.com/, but Technorati shows its URL without the trailing slash.
…<div class="url"><a href="http://sebastians-pamphlets.com">http://sebastians-pamphlets.com</a> </div> <a class="image-link" href="/blogs/sebastians-pamphlets.com"><img …

By the way, they forgot dozens of fans (folks who “fave’d” either my old blogspot outlet or this site) too.
Blogs claimed at Technorati

I’ve added a description and tons of tags, that both don’t show up on public pages. It seems my tags were deleted, at least they aren’t visible in edit mode any more.
Edit blog settings at Technorati

Shortly after the submission, Technorati stopped to adjust the reputation score from newly discovered inbound links. Furthermore, the list of my recent posts became stale, although I’ve pinged Technorati with every update, and technorati received my update notifications via ping services too. And yes, I’ve tried manual pings to no avail.

I’ve gained lots of fresh inbound links, but the authority score didn’t change. So I’ve asked Technorati’s support for help. A few weeks later, in December/2007, I’ve got an answer:

I’ve taken a look at the issue regarding picking up your pings for “sebastians-pamphlets.com”. After making a small adjustment, I’ve sent our spiders to revisit your page and your blog should be indexed successfully from now on.

Please let us know if you experience any problems in the future. Do not hesitate to contact us if you have any other questions.

Indeed, Technorati updated the reputation score from “56″ to “191″, and refreshed the list of posts including the most recent one.

Of course the “small adjustment” didn’t persist (I assume that a batch process stole the trailing slash that the friendly support person has added). I’ve sent a follow-up email asking whether that’s a slash issue or not, but didn’t receive a reply yet. I’m quite sure that Technorati doesn’t follow 301-redirects, so that’s a plausible cause for this bug at least.

Since December 2007 Technorati didn’t update my authority score (just the rank goes up and down depending on the number of inbound links Technorati shows on the reactions page - by the way these numbers are often unreal and change in the range of hundreds from day to day).
Blog reactions and authority scoring at Technorati

It seems Technorati didn’t index my posts since then (December/18/2007), so probably my outgoing links don’t count for their destinations.
Stale list of recent posts at Technorati

(All screenshots were taken on February/05/2008. When you click the Technorati links today, it could hopefully will look differently.)

I’m not amused. I’m curious what would happen when I add
if (!preg_match("/Technorati/i", "$userAgent")) {/* redirect code */}

to my canonicalization routine, but I can resist to handle particular Web robots. My URL canonicalization should be identical both for visitors and crawlers. Technorati should be able to fix this bug without code changes at my end or weeky support requests. Wishful thinking? Maybe.

Update 2008-03-06: Technorati crawls my blog again. The 301 redirects weren’t the issue. I’ll explain that in a follow-up post soon.



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

52 Comments to "Why storing URLs with truncated trailing slashes is an utterly idiocy"

  1. Wayne Smallman on 6 February, 2008  #link

    This is something of a pet peeve of mine, too.

    I share you angst…

  2. Rich on 6 February, 2008  #link

    Errrmmmm…… I totally don’t get your point… at all!?!

    You receive the traffic, so why can’t you take control of it?

    I feel that either you or I are missing some thing here…

    - rich

  3. Jason on 6 February, 2008  #link

    As always, great post Sebastian… Good work. I’m a big fan of your blog, thanks for posting!

  4. Derek on 6 February, 2008  #link

    Really nice breakdown here and something to bookmark for future reference (don’t know how many times I’ve been on the same side of this discussion).

    Thanks for such a detailed post.

  5. Sebastian on 6 February, 2008  #link

    Thanks guys! :)
    Rich, the big deal is not the traffic that I receive at crappy URLs and redirect to my contents. My problem is the traffic I’ll never receive because a clueless assclown 2.0 screws my URLs.

    Jason, thanks for your recommendation of my Technorati mass ping tool.

  6. Andy Beard on 6 February, 2008  #link

    For some reason I don’t have any canonical problems at all with Technorati.
    They seem to pick up with or without / as the same domain, and even www is treated the same - that gets used quite often by people.
    The link from my blog profile with them is with a /

    I just hope that they never fully support 301 redirects, or if they do, they also stop counting sidebar links at the same time. Using a different domain and 301s is currently the easiest way to avoid being banned from their top100 and search results for gaming their system.

  7. SearchCap: The Day In Search, February 6, 2008…

    Below is what happened in search today, as reported on Search Engine Land and from other places across the web…….

  8. […] Why storing URLs with truncated trailing slashes is an utterly idiocy — “Dear Web developers, if you really think that home page locations respectively directory URLs look way cooler without the trailing slash, then by all means manipulate the anchor text, but do not manipulate HREF values, and do not store truncated URLs in your databases (not that “http://example.com” as anchor text makes any sense when the URL in HREF points to “http://example.com/”).” […]

    [That’s a nice list of resources, descriptive linking to tons of articles that discuss URLs and related topics. Sebastian]

  9. Sebastian on 6 February, 2008  #link

    Lucky you, Andy. Today I’ve sent them 3 pings (one from WP with this post, one manual ping at the Technorati site, and one from my ping tool), and they still claim that they’ve received the last ping 48 days ago. No wonder that they don’t index me. Beside the trailing slash issue they might be confused by two blogs with the same name (this site and my outdated blogspot thingy). Both possible problems, if issues at all(!), would indicate serious architectural problems or misconceptions of design principles. I may be totally wrong, though. I’m still waiting for an explanation.

    They really should fully support 301 redirects. There are so many bloggers who use canonicalization plugins, or other methods that all 301-redirect from invalid URLs to canonical URLs. It’s really not up to Technorati, or any other search engine for that matter, to decide which URL is the right one. That’s the bloggers decision, and the sole method to anounce it is the 301 redirect.

    As for the sidebar links, that’s not a big deal IMO. Many blogs link to comment authors in their sidebar, and technorati rightly counts those links like any other link on the main page. Why shouldn’t they? When a blogger decides to highlight comment authors, that’s –at the time of crawling– a vote as good as a persistent blogroll link. The same goes for widgets that for example create server sided links from shared GoogleReader feeds and stuff like that. Ignoring sidebar ads and nofollow’ed links shouldn’t be too complicated.

    I see another problem. More and more blogs serve their blogroll on a separate page, and Technorati igores those. I mean it’s really not that hard to tell a crawler that it has to follow links with an anchor text of “blogroll” and a few variations like “bragroll” or so. Also, there’s always the OPML feed, that –if the blogger doesn’t maintain a hard coded blogroll– serves all blogroll links in a nicely structured format (XML), at least with WordPress blogs.

  10. […] Why storing URLs with truncated trailing slashes is an utterly idiocy Great summary on “invalid” URLs because of missing trailing slashes and why these URLs can damage page popularity. […]

  11. jack on 6 February, 2008  #link

    Great stuff Sebastian, thanks for sharing
    so much good info.

    I have one question though: does this also
    matter with the main directory, ie:
    www.example.com/ and www.example.com
    should this be redirected too?

  12. Rich on 7 February, 2008  #link

    > Rich, the big deal is not the traffic that I receive at crappy URLs and redirect to my contents. My problem is the traffic I’ll never receive because a clueless assclown 2.0 screws my URLs.

    O.K: I’ve thought this through, and I’ve done some testing. Now I get you… You’ve taught me something - so thank you for that!

    - rich

    PS: And yes, I hope assclown2.0 is listening!

  13. Sebastian on 7 February, 2008  #link

    Thanks, Jack. :) www.example.com or example.com are directory URLs too, hence they’re invalid without the trailing slash. Redirecting an empty path to “/” should be a good idea. Most probably your server does that already, and hopefully returning the 301 HTTP response code, not 302. You can test the response with server header checkers.

    Rich, my pleasure. And yes, assclown 2.0 better listens. ;)

  14. Allan Stewart on 7 February, 2008  #link

    Seb,

    How much of an issue is this. My blog used mod-rewrite to create URL’s which look like http://www.domain.com/directories/but/no/slash/at/the/end

    I can’t say that its every gived me any problems. Not that I was aware of anyway. I have re-read your article and it would seem that I may have an issue on Technorati, but again I don’t appear to have.

    Can you give any more specific insite?

    Cheers

    Allan

  15. […] Why storing URLs with truncated trailing slashes is an utterly idiocy A bit tecchy, and not sure what I can do about incoming links, but I will work hard to ensure all of my outgoing links now end with a slash. Hat-tip to Dan York for the tweet (tags: guide programming reference webdesign url seo) […]

  16. Critic on 7 February, 2008  #link

    No, the “U” in URL is “Uniform”. I quit reading after that flub.

  17. Sebastian on 7 February, 2008  #link

    Critic, when you read Folks should understand the “U” in URL as unique as “the ‘U’ in URL stands for ‘unique’”, I’m sorry. Of course the “U” in URL is “Uniform”. What I meant is URLs are unique pointers to particular resources, that is one URL points to one and only one resource, and there shouldn’t be more than one URL per resource.

  18. Sebastian on 7 February, 2008  #link

    Allan, that’s not an issue. When you choose a permalink structure without trailing slashes, that’s fine (you could also add a constant to “fake” .html pages). If you understand your virtual URLs as pseudo directories, add the trailing slash. If you understand them as pseudo files, then don’t add a slash. Just make sure that your blog software doesn’t serve content unter URLs with trailing slash then, respectively 301-redirect those to the canonical URLs without trailing slash. I went for the version with trailing slash, because “files” without an extension look pretty unusual for DOS/WIN users. Also, I’ve some real directories on this domain, and thought that a consistent naming convention makes sense. A visitor doesn’t know whether a particular URL points to a static page, or a script output generated by a CMS that has no persistent file corresponding to the path string in the browser’s address bar on the server’s hard disk. Visitors just like meaningful URLs.

    The point is not that all virtual URLs without a file extension must end with a slash. The point is that regardless what you use, you must not allow others to remove anything from your URLs, respectively add anything to your URLs.

    If your CMS creates files like /dir/index.php that your Web server delivers to requestors in the same way as pages or scripts you’ve uploaded per FTP, that’s a completely other story. Then the canonical URL is /dir/ and you must redirect /dir to /dir/.

    Did the above said answer your question?

  19. Melanie Phung on 8 February, 2008  #link

    Another great pamphlet Sebastian. Thanks. I’ve tried to have conversations like this a few times and never really managed to get my point across effectively. Maybe I’ll point people to this from now on (although I doubt the folks I’m talking to would understand this)… or I’ll wave my hands in the air and ask them to help me dispel the “voodoo”.

    p.s. why do you even give a crap about Technorati?

  20. Sebastian on 8 February, 2008  #link

    Thanks Melanie. :) I so hoped that this pamphlet would be easy to understand, not only for the engineers in the audience. Well, since that obviously didn’t work out, just point ignorant folks to my pamphlet telling them that if they sneakily manipulate URIs a very grumpy red crab will come all over their SERPs greedily gobbling their top positions.

    As for Technorati, it used to be a nice tool. Not in terms of traffic, but link discovery and such. Also, monitoring a blog’s reputation at Technorati gives you a rough idea of what real search engines might discover and how much weight your fresh inbound links from the blogosphere might have.

  21. Utah SEO Pro on 8 February, 2008  #link

    Sebastian, I’ve been wonderin how to fix that for a long time. You rock..thanks :)

  22. Pocket SEO on 11 February, 2008  #link

    Great article…
    Not sure if you mentioned it in the other post, but MSN/Live is also removing trailing slashes on display (but linking to the trailing slashes).

  23. Web Designer Group on 13 February, 2008  #link

    How truncated URLs are useful after storing in database?

  24. Sebastian on 13 February, 2008  #link

    Truncated URLs aren’t useful at all, not even as anchor text. Only idiots store them in databases.

  25. Sebastian on 4 March, 2008  #link

    Another support ticket was overdue:

    2008-03-04
    Howdy, although I ping you with every update, you don’t change my blog’s authority/rank, and you don’t index my posts. A while ago I’ve sent you another ticket and a very friendly support person made “a small tweak” to resolve the issue. Well, you’ve refreshed everything, but it’s stale again since 75 days. Do you have problems with the 301 redirect that prevents you from accessing my blog without the canonical trailing slash? Do I ping too much (WordPress seems to ping you even on changes of posts)? I don’t think that pinging is an issue, because it would be outright unfair to penalize a blogger for the default behavior of popular software like WP. So what’s going on? If I can change anything here, please tell me. For a detailed description of the problem please refer to this post:
    http://sebastians-pamphlets.com/thou-must-not-steal-the-trailing-slash-from-my-urls/#url-canonicalization-irritates-technorati
    Again, if I can help to solve the issue, please drop me a line.
    Have a nice day!
    Sebastian

    Lets see what happens …

  26. Andy Beard on 4 March, 2008  #link

    You might have better luck pinging a Technorati guy directly… want some help on that?

  27. Sebastian on 4 March, 2008  #link

    Thanks Andy, I’ve emailed Ian Kallen and posted a message in the support forum. I’d very much appreciate your help. :) TIA!

  28. Andy Beard on 4 March, 2008  #link

    I was thinking of an email to Ian as well.

    You know he is also on Twitter

  29. Sebastian on 4 March, 2008  #link

    Thanks Andy, I’m following him now :)

  30. Tristan on 5 March, 2008  #link

    I love how in your “exception” section you basically invalidate your entire post.

    “URLs in print, radio, and offline in general, should be truncated in a way that browsers can figure out the location - “domain.co.uk” in print and “domain dot co dot uk” on radio is enough. The necessary redirect is cheaper than a visitor who doesn’t type in the canonical URL including scheme, www-prefix, and trailing slash.”

    So, for users, and in any case where a user has to actually *see* a URL (which is in many, many cases) the consensus is that the redirect is okay. Well then, fine.

    Is your stance then that the Internet is for human beings, or that it is for machines? If it’s only for machines then the machines creating the internet better canonically add a trailing slash, but if it’s for humans too, then we can’t trust anyone to slash or not (in fact we can barely trust them to spell the URL right in the first place) and should have measures to get them to the right place from a usability perspective.

    The problem: it’s for both humans and machines. Humans are the lowest common denominator (aka: the stupidest and the hardest to control) so my vote is that the machines do whatever they can to yield to the needs of the humans.

    That’s all I have to say, otherwise it’s a good article, but basically can be summed up by, “web servers and web services that parse URLs and browsers are too picky, so we should cater to them.” I’d appreciate more of a recognition that the whole idea that “http://www.url.com/” is different from “http://www.url.com” is a silly meaningless machine-centric detail to begin with, and even though it’s not right, we have to deal with it anyway. That’s my perspective. Otherwise you’re completely correct. :)

  31. Sebastian on 5 March, 2008  #link

    Tristan, when a user has to see an URI (on the Interweb) and can click it, it should be the canonical URI. If a user has to read or hear an URI (elsewhere) and is supposed to remember it, it should be as short as possible. The machines populating todays Web aren’t smart enough to deal with “offline inputs”.

  32. Tristan on 5 March, 2008  #link

    And my point being: they should be. Machines should be smart enough to deal with any reasonable input thrown at them.

    So I’m just trying to say, it’s not anyone’s fault that we have to work around these limitations, it’s just something we have to do. Your essay came across as confrontational toward people who don’t understand why URIs should have trailing slashes, when in fact there’s no logical reason for anyone to understand that, because it’s naturally illogical that the trailing slash matters at all — except to a machine.

    I understand it perfectly. I just don’t like being told it’s my (or any other web developer’s) fault, because it’s definitely not. It’s essentially a flaw in the system that we have to work around, and should be presented as such.

  33. Sebastian on 5 March, 2008  #link

    Tristan, when a user types in a sneakily shortened URI that’s fine. When a Web developer –who should know better– links to a malformed URI that’s evil. Yes, that’s really evil, because this developer is aware of the not yet perfect machines, and this developer knows that the majority of all machines, including important Web robots like search engine crawlers, do rely on compliance to Web standards. Just because we wish that all this should be easier, that doesn’t change the reality. Here and now Web developers just have to comply to the standards, whether they like it or not. Our compilers are not as fault tolerant as Web servers, we understand that so we check the syntax beforehand and don’t deliver code that doesn’t compile or produces runtime errors when interpreted. Why is that different from sticking to canonical URIs?

  34. Tristan on 5 March, 2008  #link

    It is, in fact, completely different because canonical URIs are a shared syntax between unpredictable human users and machines! That makes it the responsibility of the machine (and the programmers who manage them) to make URIs the most usable they can be for the humans. Humans come first, machines serve humans, it’s a general concept. The first step to that is accepting that humans will break the rules, because it’s a statistical certainty.

    They’re not “sneaky”, they’re just human, and we *cannot* enforce how they enter URIs. In fact, we don’t want to, because we want to make it as easy and simple as possible, and thus attempt to determine what they want even if they don’t follow perfect syntax. It is not their fault and it is not expected of them to be a computer.

    The problem, then, is that this system is used by both machines and users. That is the problem. The basic, root problem.

    Like I said before (and that you’re not listening to, stuck in your preachy argument) I understand and agree with everything you’re saying, that this is the reality, and developers should definitely use correct URI format.

    I’m simply discussing the whole background of the situation, and if you’re not willing to look at why the system is the way it is, and why developers might be accustomed to not caring about the slash, and why the machines care but the users don’t — then you’re just spewing rules for the sake of rules. I’m just trying to look at the bigger picture here.

    Let me just say again, I agree with you completely about the rules and the reality, I just disagree completely on the way you portray them.

    The rules are *not* there for a good reason, they’re there for the machines, ignoring completely the fact that both machines and humans use them and have different expectations and needs. Of course we should follow them as programmers and developers if we know that we should, that’s obvious. But you’re trying to solve a problem here (that being, that developers use the wrong syntax for URIs) and don’t understand the underlying issue (that developers are human users too, and are rightfully accustomed to lax ‘human’ URI syntax), and that’s irking me. Think about it.

    And one more time, I understand the rules, don’t try to explain them to me again. Look deeper.

  35. Sebastian on 5 March, 2008  #link

    Thanks Tristan for your appreciated thoughts. I totally get your point. I just can’t agree. Developers are the middlemen between humans and machines. If a developer fails to translate in any direction that’s, well, fault. And yes, the rules are there for a good reason. This good reason is that the Web cannot function without machines, whether they’re perfect or not. Machines understand proper syntax and semantics, and they often fail on guesswork. Just because in an ideal world machines would understand intentions and perhaps even read our thoughts, that does not mean that we can throw away the protocols today. As long as we don’t live in an ideal world, GIGO applies. Please reread my phone number example. Would you support the civilized human who refuses to make use of the phone number’s last digit too? Of course not, because that’s plain foolish as long as not all phones on this planet have an autocomplete helper that cannot fail.

    The “sneakily” was not meant seriously.

  36. Tristan on 5 March, 2008  #link

    That’s exactly right in today’s Internet, in the world we live in, in reality. That’s what you’re talking about.

    All I’m trying to do is discuss how we might improve the state of the system to better support both humans and machines, which is what the system designed for machines only is doing right now with hacks (redirects) to support natural human error.

    Your example with phone numbers doesn’t make sense anyway. Even in today’s world, the real analogy is as if there was an extra “0″ at the end of each phone number (so, 555-01000, 555-01010, 555-01090), and it didn’t really matter if you included it or not (555-0100, 555-0101, 555-0109 still worked), it would still get you to the same place (because that’s how URIs *functionally* work today on the real internet). Would you still dial the extra 0 at the end just out of convention? No, of course not.

    Okay okay, I’m not arguing anymore. I understand and agree with you, I was just put off by your machine-centric viewpoint. I suppose it’s the individual site/server’s responsibility to support user URI errors such as missing slashes, and then to use correct URIs for all external links and resources.

    I just want to make sure you understand that, for users’ sake, a resource without a slash had darn well better mean the same thing as with the slash. If a domain without a slash gave a 404, then for all usability purposes it is simply broken. So, that too is the responsibility of the site to accommodate its users, which is equal to or greater than its responsibility to other sites on the internet.

  37. Sebastian on 6 March, 2008  #link

    Tristan, the phone number example totally makes sense, and that’s the point you don’t get. The trailing slash is part of the address. When you remove it in some cases you get different content, and in some cases you run in a redirect. A manipulated URI is like a different phone number. When you add a character to an URI, as you do with your additional digit, you will always run in a 404-Not found error on the Web. Adding stuff to URIs is as bad as truncating them in hyperlinks. With phone numbers of call centers your additional digit will be taken as input after the connection was established, so you can end up in a support queue instead of chatting with a sales rep. So the lesson is, don’t play with addresses of any kind. Frankly, neither stealing the trailing slash from URIs nor adding or removing digits from phone numbers is “natural human behavior”. It’s a plain stupid error that kids learn to avoid in elementary school, if they didn’t figure it out already by applying the trial-and-error method. In other words, in some cases folks just have to play by the rules, because they land nowhere otherwise - online as well as offline. I don’t see a way to improve the current error handling in both cases (invalid URI / phone#), and there’s absolutely no reason to encourage users to use incorrect respectively manipulated identifiers. BTW a directory URI request without trailing slash runs in a 404 internally before a well configured Web server recognizes that “the user is simply broken” and replies with a redirect header. “Machine-centric viewpoints” can save the human vistors time, and hassles. ;)

  38. Tristan on 6 March, 2008  #link

    I think our problem here is that we’re talking about different things. I’m talking about users typing in URIs and you’re talking about links and references created by developers or applications. From a user’s point of view when typing the URI the slash is meaningless (since the content is the same in 99% of cases) and that will never change, nor should it ever, since it’s in a server’s best interest to allow both. From a machine’s perspective, it’s best to have a single canonical URI. Thus, redirects for the users and slashes for the machines. Best of both worlds.

    So, we were on completely different tracks… sorry for the confusion, I thought it was clear I was talking about users.

    But if you go back to what you wrote here…. “Tristan, when a user has to see an URI (on the Interweb) and can click it, it should be the canonical URI. If a user has to read or hear an URI (elsewhere) and is supposed to remember it, it should be as short as possible. The machines populating todays Web aren’t smart enough to deal with “offline inputs”.”
    — That I agree with completely and let’s leave it at that. I was just interested in discussing deeper issues of human-centric issues around URI and their “correct” behavior, but apparently you aren’t.

  39. Sebastian on 6 March, 2008  #link

    Tristan, seems you’re right, I don’t consider developers plain users. As for users navigating the Web in the browser address bar, well, search engines do a great job leading type-in queries to the best matching result, or a SERP. That’s a great and useful layer between humans and URIs, and it should get improved further, beyond autocomplete, ajax’ed suggestions and such neat stuff.

    I strongly believe that users shouldn’t deal with URIs at all, I mean that’s why we’ve links with meaningful anchor text. But as long as I can’t click a link on my telly or on a placard, type-in traffic exists, albeit it lands at quite exotic and unrelated destinations in many cases. The challenge for developers is to keep users away from URIs wherever possible.

  40. Matt N on 20 March, 2008  #link

    Yeah there is a lot of misunderstanding about URI’s when it comes to this department. This W3C document helped me a lot:

    http://www.w3.org/TR/chips/#uri

    The thing that stands out to me most (and what most people misunderstand about URI’s because of Apache’s default configuration) is that a URI IS NOT a filesystem. Apache uses that as a convenient default because it’s a ‘quick and dirty’ way to map URI’s to resources on the filesystem. Apache provides many methods for making this mapping other than the filesystem. Modules such as mod_alias, mod_rewrite and mod_negotiation (as well as many others) are much more advanced methods of configuring your server for mapping URI’s to resources on the filesystem.

    The ‘redirects that the server does automatically’ involving trailing slashes with URI’s is actually due to an Apache module called mod_speling. This module is activated in Apache’s default install and is one of the causes of this confusion!

  41. Kenpachi on 26 March, 2008  #link

    Awesome article! I was recently having issues with my site and Google Sitemaps, Googlebot kept encoutering 302 redirect errors. After some reearch I found that google expects your urls in the sitemap to end with a trailing slash.

    End story was I had to recode a few parts, and I had to add a slash to the end of every url, since I make heavy use of URL re-writing. I was about to look for a way to make my site “look cool” by having an option for a non-trailing urls, but your article straightened me out :)

  42. Giuseppe on 24 March, 2009  #link

    It doesn’t matter what’s cool and what’s not, fashion of the day changes, well, everyday.

    What’s important (and you haven’t addressed it) is the fact that people can link to your URL with or without the trailing slash.
    And since Google doesn’t give you full recognition for a link that points to a 301 you want to make sure your site returns 200 for URL with AND without slash.

    That is exactly what happen to my site, being dropped off the Google index because of an automatically added slash (and all of my links don’t have slashes)

    If you make a living with a website this is something to take it into consideration.
    If you just play with it we can discuss what’s cool and what’s not.

  43. […] Well, to begin with, if you get such a reply from a web developer, start looking for another one because this answer is profoundly incorrect, to say at least. When it comes to an URL, every single character matters, I like it how Sebastien put it in his totally cool post on stealing the trailing slash from the URL: […]

  44. […] is an old post on Sebastian’s Pamphlets that has been picked up by Ann Smarty on Search Engine Journal. The question of using exact URLs is […]

  45. […] parar no lugar errado, ou em lugar nenhum. Façamos a analogia dos números de telefone, como fez Sebastian: pense na URL como um número de telefone. Se você discar 3165-2020 terá a pizzaria que […]

  46. Stephen on 10 February, 2010  #link

    Hello Sebastian.

    I never thought I would come across an article so pertinent to my long-held secret: I’ve always had a problem with “trailing” slashes. Im not an expert but I always understood that our root index page should be expressed as: http://wwww.example.com/ for instance. And that made sense, and I have tried to use that. Over the years though, I have found that many site submission companies would not accept this format. To make a long story short, I began designing various iterations of my site by using the following base href: http://www.example.com (i.e. no trailing slash). So as a result of that, I have always written links to internal directories in what I now assume is an incorrect format, as so: /MedwordStore/index.html. I preface internal links with the forward slash since the base reference doesn’t include one. I suppose I could change the base href in each directory to minimize the size of actual written links under each directory. I also work on our virtual server sometimes and, I may be wrong, but I thought the rule was that those directories are to be finished without a trailing slash?

    Do I have any chance of redemption? Can I change my wrong-thinking and still make bad web designs? :-)

    [301 is your friend]

  47. Brian on 11 February, 2010  #link

    Giuseppe said:
    It doesn’t matter what’s cool and what’s not, fashion of the day changes, well, everyday.

    What’s important (and you haven’t addressed it) is the fact that people can link to your URL with or without the trailing slash.
    And since Google doesn’t give you full recognition for a link that points to a 301 you want to make sure your site returns 200 for URL with AND without slash.

    That is exactly what happen to my site, being dropped off the Google index because of an automatically added slash (and all of my links don’t have slashes)
    =======================

    Is this true??

  48. g1smd on 23 March, 2010  #link

    Yes, this is exactly the point. Quoting example.com in a printed newspaper advert, or on radio and TV, allows people to access the site by typing the least amount of characters, the site redirecting the user to www.example.com/ before showing the content. There are no searchengine related effects.

    On the other hand, where there is another website linking out to a non-canonical URL such as example.com or example.com/index.html this may mean that the linked-to site does not get a full credit for that link when it is the URL www.example.com/ that will serve the content after the redirect.

    The redirect is necessary. It removes the possibility of Duplicate Content issues, but there might still be some drawbacks when non-canonical URLs are used in links.

  49. Sebastian on 18 May, 2010  #link
  50. BlueBoden on 12 December, 2010  #link

    I think you are wrong with the http://example.com/ redirect, because the requested path will always be “/”, it makes no sense to redirect http://example.com to http://example.com/

    The HTTP specification states, that a empty path, is the same as “/”, in other words the root directory.

    [Link removed, page doesn’t load in Chrome. For search engines and other services indexing Web resources by URI http://example.com and http://example.com/ are (usually) different URIs, giving two records for a resource. That’s crap. Sebastian]

  51. BlueBoden on 16 January, 2011  #link

    Sebastian you are indeed wrong, if the “/” slash is missing, it will be automatically assumed. The path can NOT be empty, I’ve read the specifications.

    It will not generate any duplicate content.

  52. […] This article explains the matter in more details. Note however that it advocates retaining the trailing slash, but it doesn’t focus on pretty urls in particular. Yes, you should retain the trailing slash for your index page and for any directories, but that’s it. The rest are resources and so they should not have the slash. The main point is that you must be consistent. From the practical point of view it is not that wrong to retain the slash (although it contradicts the directory structure logic), but have it for every resource, and have all your links conform to that choice. […]

Leave a reply


[If you don't do the math, or the answer is wrong, you'd better have saved your comment before hitting submit. Here is why.]

Be nice and feel free to link out when a link adds value to your comment. More in my comment policy.