Yahoo

Archived posts from the 'Yahoo' Category

Google to change the Robots Exclusion Protocol again

Posted on 30 November, 2007

Google jumping the shark Web crawler directives, partly standardized in the Robots Exclusion Protocol (REP), evolved since 1994. Nowadays we’ve to deal with a conglomerate of not binding de facto standards and microformats, all of them extended by various organizations. All search engines claim that they obey “the standard”, but they refer to their very own REP implementation. In fact, each search engine supports a proprietary set of REP directives, differently from other players as a rule.

Google is the search engine putting the most efforts into Robots Exclusion Protocol (REP) evolvements. Their XML Sitemaps handling submissions instead of crawl restrictions changed the REP to a wider scope, the X-Robots-Tag brought us robots meta tags for non-HTML resources like PDF documents, images or video clips, and with Unavailable_after Google made a few clueless news sites happy. With the rel-nofollow microformat on the other hand, respectively its sneaky morphing from a spam fighting tool to its current shape, Google made nobody happy. Yahoo contributed the well meant but half-assed “robots-nocontent” class name, and of course “noydir” (it’s unlikely that any other engine will support those).

Now Google is working on new robots.txt syntax, and I am, politely put, not amused. Here is why I fear that Google is going to totally mess up the REP:

Google supports a “Noindex:” directive in robots.txt, which is treated as “Disallow:”¹⁾. Of course that’s an experiment, but if this behavior doesn’t change we’ll get a beast that is -with regard to the confusion it will produce- way more evil than the rel-nofollow fiasco.

A noindex-alias for disallow makes no sense, even when such syntax errors are out there.
Mixing crawler directives (allow/disallow) with indexer directives (noindex) is not always a bright idea. It’s bad enough that most Webmasters still believe that “Googlebot ranks their stuff”. (Actually, in some cases it can make sense. For example “nofollow” in robots meta tags (or at least for Google in REL attributes too) is both a crawler instruction as well as an indexer directive.)
Noindex and disallow are completely different commands. The REP’s noindex directive means “crawl it, follow its links, but don’t list it on the SERPs”. Disallow forbids crawling, but allows indexing URLs from directory listings or other inbound links.

Standards should be clear and unambiguous. Google must not redefine syntax and semantics that were in widespread use before Google even existed. I admit they’ve the power to fuck up the REP, but they also have “do no evil”.

Considering that Google is run by a bunch of smart engineers, I hope that they’ll do the right thing eventually. The right thing in this case is giving more power to REP evolvements, before questionable and selfish anti-search initiatives like ACAP ruin both the robots.txt consensus as well as the robots meta tag standard.

My idea of more power to REP evolvements is:

Sensible implementation of crawler/indexer-directives adapted from REP tags in robots.txt. Applying page-level instructions ((no)index, (no)follow, noarchive, nosnippet, noodp/noydir, unavailable_after and hopefully nopreview) to groups of URIs is a great way to steer crawling and indexing, especially for sites which for various reasons cannot make use of the HTTP header’s X-Robots-Tag.
Implementation of block-level directives in robots.txt. Allowing Webmasters to apply crawler instructions like “noindex” or “nofollow” to particular page areas, like advertising blocks, duplicated text or repetitive navigation elements, addressed via HTML element names and class names and/or DOM-IDs, would be a very flexible instrument to steer crawling and indexing, and it could eleminate many points of failure.
Getting Webmasters, Publishers, SEOs and all major engines together to discuss possibly missing granularity and to develop a binding norm obeyed by all players.

The last one sounds like wishful thinking. The alternative is that Google (and, if possible, the bigger engines) talk with Webmasters and then launch the necessary REP extensions. The other engines will follow sooner or later. The publishers, although not getting all their desired ACAP restrictions, will be happy too. Standards like the Robots Exclusion Protocol should be developed by engineers.

¹⁾ Noindex: is not a plain Disallow:, there’s an interesting difference. In Google’s experiment both directives block crawling, but Disallow: allows URL-indexing based on 3rd party information, and Disallow:‘ed URLs can accumulate PageRank from internal as well as external links. Noindex:‘ed URLs on the other hand will not appear on SERPs as URL-only listing or with an ODP title and snippet, and I’m quite sure that they will not gather PageRank nor other link juice. That means links from any pages to such URLs get an implicit rel-nofollow in Google’s PageRank calculation, just like dangling links. This apparatus could be a great way to handle PageRank leaks (monthly blog archives, printer friendly pages and stuff like that), because shit happens, hence some links to such pages will slip through without condom. I admit that’s a neat idea, but its implementation is flawed because it doesn’t consider the implicit Follow: (that’s syntax Google doesn’t support in robots.txt). A better way to mark site areas which shall not gather PageRank without raping the REP would be a Norank: directive or so. Noindex: without a Nofollow: must not block crawling. Googlebot must fetch those URLs to follow their links.

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

8 comments Sebastian | Crawler Directives, Robots Meta Tags, X-Robots-Tag, XML-Sitemaps, robots.txt, Microformats, Google, Yahoo, Nofollow

The anatomy of a server sided redirect: 301, 302 and 307 illuminated SEO wise

Posted on 9 October, 2007

HTTP Redirects We find redirects on every Web site out there. They’re often performed unnoticed in the background, unintentionally messed up, implemented with a great deal of ignorance, but seldom perfect from a SEO perspective. Unfortunately, the Webmaster boards are flooded with contradictorily, misleading and plain false advice on redirects. If you for example read “for SEO purposes you must make use of 301 redirects only” then better close the browser window/tab to prevent you from crappy advice. A 302 or 307 redirect can be search engine friendly too.

With this post I do plan to bore you to death. So lean back, grab some popcorn, and stay tuned for a longish piece explaining the Interweb’s forwarding requests as dull as dust. Or, if you know everything about redirects, then please digg, sphinn and stumble this post before you surf away. Thanks.

Jump Station

Redirects are defined in the HTTP protocol, not in search engine guidelines

For the moment please forget everything you’ve heard about redirects and their SEO implications, clear your mind, and follow me to the very basics defined in the HTTP protocol. Of course search engines interpret some redirects in a non-standard way, but understanding the norm as well as its use and abuse is necessary to deal with server sided redirects. I don’t bother with outdated HTTP 1.0 stuff, although some search engines still apply it every once in a while, hence I’ll discuss the 307 redirect introduced in HTTP 1.1 too. For information on client sided redirects please refer to Meta Refresh - the poor man’s 301 redirect or read my other pamphlets on redirects, and stay away from JavaScript URL manipulations.

What is a server sided redirect?

Think about an HTTP redirect as a forwarding request. Although redirects work slightly different from snail mail forwarding requests, this analogy perfectly fits the procedure. Whilst with US Mail forwarding requests a clerk or postman writes the new address on the envelope before it bounces in front of a no longer valid respectively temporarily abandoned letter-box or pigeon hole, on the Web the request’s location (that is the Web server responding to the server name part of the URL) provides the requestor with the new location (absolute URL).

A server sided redirect tells the user agent (browser, Web robot, …) that it has to perform another request for the URL given in the HTTP header’s “location” line in order to fetch the requested contents. The type of the redirect (301, 302 or 307) also instructs the user agent how to perform future requests of the Web resource. Because search engine crawlers/indexers try to emulate human traffic with their content requests, it’s important to choose the right redirect type both for humans and robots. That does not mean that a 301-redirect is always the best choice, and it certainly does not mean that you always must return the same HTTP response code to crawlers and browsers. More on that later.

Execution of server sided redirects

Server sided redirects are executed before your server delivers any content. In other words, your server ignores everything it could deliver (be it a static HTML file, a script output, an image or whatever) when it runs into a redirect condition. Some redirects are done by the server itself (see handling incomplete URIs), and there are several places where you can set (conditional) redirect directives: Apache’s httpd.conf, .htaccess, or in application layers for example in PHP scripts. (If you suffer from IIS/ASP maladies, this post is for you.) Examples:

Browser Request:	`ww.site.com /page.php?id=1`	`site.com /page.php?id=1`	`www.site.com /page.php?id=1`	`www.site.com /page.php?id=2`
Apache:	301 header: `www.site.com /page.php?id=1`
.htaccess:		301 header: `www.site.com /page.php?id=1`
/page.php:			301 header: `www.site.com /page.php?id=2`	200 header: `(Info like content length...)` Content: Article #2

The 301 header may or may not be followed by a hyperlink pointing to the new location, solely added for user agents which can’t handle redirects. Besides that link, there’s no content sent to the client after the redirect header.

More important, you must not send a single byte to the client before the HTTP header. If you for example code [space(s)|tab|new-line|HTML code]<?php ... in a script that shall perform a redirect or is supposed to return a 404 header (or any HTTP header different from the server’s default instructions), you’ll produce a runtime error. The redirection fails, leaving the visitor with an ugly page full of cryptic error messages but no link to the new location.

That means in each and every page or script which possibly has to deal with the HTTP header, put the logic testing those conditions at the very top. Always send the header status code and optional further information like a new location to the client before you process the contents.

After the last redirect header line terminate execution with the “L” parameter in .htaccess, PHP’s exit; statement, or whatever.

What is an HTTP redirect header?

An HTTP redirect, regardless its type, consists of two lines in the HTTP header. In this example I’ve requested http://www.sebastians-pamphlets.com/about/, which is an invalid URI because my server name lacks the www-thingy, hence my canonicalization routine outputs this HTTP header:
HTTP/1.1 301 Moved Permanently
Date: Mon, 01 Oct 2007 17:45:55 GMT
Server: Apache/1.3.37 (Unix) PHP/4.4.4
Location: http://sebastians-pamphlets.com/about/
Connection: close
Transfer-Encoding: chunked
Content-Type: text/html; charset=iso-8859-1

The redirect response code in a HTTP status line

The first line of the header defines the protocol version, the reponse code, and provides a human readable reason phrase. Here is a shortened and slightly modified excerpt quoted from the HTTP/1.1 protocol definition:

Status-Line

The first line of a Response message is the Status-Line, consisting of the protocol version followed by a numeric status code and its associated textual phrase, with each element separated by SP (space) characters. No CR or LF is allowed except in the final CRLF sequence.

Status-Line = HTTP-Version SP Status-Code SP Reason-Phrase CRLF
[e.g. “HTTP/1.1 301 Moved Permanently” + CRLF]

Status Code and Reason Phrase

The Status-Code element is a 3-digit integer result code of the attempt to understand and satisfy the request. […] The Reason-Phrase is intended to give a short textual description of the Status-Code. The Status-Code is intended for use by automata and the Reason-Phrase is intended for the human user. The client is not required to examine or display the Reason-Phrase.

The first digit of the Status-Code defines the class of response. The last two digits do not have any categorization role. […]:
[…]
- 3xx: Redirection - Further action must be taken in order to complete the request
[…]

The individual values of the numeric status codes defined for HTTP/1.1, and an example set of corresponding Reason-Phrases, are presented below. The reason phrases listed here are only recommendations — they MAY be replaced by local equivalents without affecting the protocol [that means you could translate and/or rephrase them].
[…]
300: Multiple Choices
301: Moved Permanently
302: Found [Elsewhere]
303: See Other
304: Not Modified
305: Use Proxy
307: Temporary Redirect
[…]

In terms of SEO the understanding of 301/302-redirects is important. 307-redirects, introduced with HTTP/1.1, are still capable to confuse some search engines, even major players like Google when Ms. Googlebot for some reasons thinks she must do HTTP/1.0 requests, usually caused by weird respectively ancient server configurations (or possibly testing newly discovered sites under certain circumstances). You should not perform 307 redirects as response to most HTTP/1.0 requests, use 302/301 -whatever fits best- instead. More info on this issue below in the 302/307 sections.

Please note that the default reponse code of all redirects is 302. That means when you send a HTTP header with a location directive but without an explicit response code, your server will return a 302-Found status line. That’s kinda crappy, because in most cases you want to avoid the 302 code like the plague. Do no nay never rely on default response codes! Always prepare a server sided redirect with a status line telling an actual response code (301, 302 or 307)! In server sided scripts (PHP, Perl, ColdFusion, JSP/Java, ASP/VB-Script…) always send a complete status line, and in .htaccess or httpd.conf add a [R=301|302|307,L] parameter to statements like RewriteRule: RewriteRule (.*) http://www.site.com/$1 [R=301,L]

The redirect header’s “location” field

The next element you need in every redirect header is the location directive. Here is the official syntax:

Location

The Location response-header field is used to redirect the recipient to a location other than the Request-URI for completion of the request or identification of a new resource. […] For 3xx responses, the location SHOULD indicate the server’s preferred URI for automatic redirection to the resource. The field value consists of a single absolute URI.

Location = “Location” “:” absoluteURI [+ CRLF]

An example is:
Location: http://sebastians-pamphlets.com/about/

Redirect to absolute URLs only Please note that the value of the location field must be an absolute URL, that is a fully qualified URL with scheme (http|https), server name (domain|subdomain), and path (directory/file name) plus the optional query string (”?” followed by variable/value pairs like ?id=1&page=2...), no longer than 2047 bytes (better 255 bytes because most scripts out there don’t process longer URLs for historical reasons). A relative URL like ../page.php might work in (X)HTML (although you better plan a spectacular suicide than any use of relative URIs!), but you must not use relative URLs in HTTP response headers!

How to implement a server sided redirect?

You can perform HTTP redirects with statements in your Web server’s configuration, and in server sided scripts, e.g. PHP or Perl. JavaScript is a client sided language and therefore lacks a mechanism to do HTTP redirects. That means all JS redirects count as a 302-Found response.

Bear in mind that when you redirect, you possibly leave tracks of outdated structures in your HTML code, not to speak of incoming links. You must change each and every internal link to the new location, as well as all external links you control or where you can ask for an URL update. If you leave any outdated links, visitors probably don’t spot it (although every redirect slows things down), but search engine spiders continue to follow them, what ends in redirect chains eventually. Chained redirects often are the cause of deindexing pages, site areas or even complete sites by search engines, hence do no more than one redirect in a row and consider two redirects in a row risky. You don’t control offsite redirects, in some cases a search engine has already counted one or two redirects before it requests your redirecting URL (caused by redirecting traffic counters etcetera). Always redirect to the final destination to avoid useless hops which kill your search engine traffic. (Google recommends “that you use fewer than five redirects for each request”, but don’t try to max out such limits because other services might be less BS-tolerant.)

Like conventional forwarding requests, redirects do expire. Even a permanent 301-redirect’s source URL will be requested by search engines every now and then because they can’t trust you. As long as there is one single link pointing to an outdated and redirecting URL out there, it’s not forgotten. It will stay alive in search engine indexes and address books of crawling engines even when the last link pointing to it was changed or removed. You can’t control that, and you can’t find all inbound links a search engine knows, despite their better reporting nowadays (neither Yahoo’s site explorer nor Google’s link stats show you all links!). That means you must maintain your redirects forever, and you must not remove (permanent) redirects. Maintenance of redirects includes hosting abandoned domains, and updates of location directives whenever you change the final structure. With each and every revamp that comes with URL changes check for incoming redirects and make sure that you eliminate unnecessary hops.

Often you’ve many choices where and how to implement a particular redirect. You can do it in scripts and even static HTML files, CMS software, or in the server configuration. There’s no such thing as a general best practice, just a few hints to bear in mind.

Doubt: Don’t believe Web designers and developers when they say that a particular task can’t be done without redirects. Do your own research, or ask an SEO expert. When you for example plan to make a static site dynamic by pulling the contents from a database with PHP scripts, you don’t need to change your file extensions from *.html to *.php. Apache can parse .html files for PHP, just enable that in your root’s .htaccess: AddType application/x-httpd-php .html .htm .shtml .txt .rss .xml .css
Then generate tiny PHP scripts calling the CMS to replace the outdated .html files. That’s not perfect but way better than URL changes, provided your developers can manage the outdated links in the CMS’ navigation. Another pretty popular abuse of redirects is click tracking. You don’t need a redirect script to count clicks in your database, make use of the onclick event instead.
Transparency: When the shit hits the fan and you need to track down a redirect with not more than the HTTP header’s information in your hands, you’ll begin to believe that performance and elegant coding is not everything. Reading and understanding a large httpd.conf file, several complex .htaccess files, and searching redirect routines in a conglomerate of a couple generations of scripts and include files is not exactly fun. You could add a custom field identifying the piece of redirecting code to the HTTP header. In .htaccess that would be achieved with Header add X-Redirect-Src "/content/img/.htaccess"
and in PHP with header("X-Redirect-Src: /scripts/inc/header.php", TRUE);
(Whether or not you should encode or at least obfuscate code locations in headers depends on your security requirements.)
Encapsulation: When you must implement redirects in more than one script or include file, then encapsulate all redirects including all the logic (redirect conditions, determining new locations, …). You can do that in an include file with a meaningful file name for example. Also, instead of plastering the root’s .htaccess file with tons of directory/file specific redirect statements, you can gather all requests for redirect candidates and call a script which tests the REQUEST_URI to execute the suitable redirect. In .htaccess put something like:RewriteEngine On RewriteBase /old-stuff RewriteRule ^(.*)\.html$ do-redirects.php
This code calls /old-stuff/do-redirects.php for each request of an .html file in /old-stuff/. The PHP script: $requestUri = $_SERVER["REQUEST_URI"]; if (stristr($requestUri, "/contact.html")) { $location = "http://example.com/new-stuff/contact.htm"; } ... if ($location) { @header("HTTP/1.1 301 Moved Permanently", TRUE, 301); @header("X-Redirect-Src: /old-stuff/do-redirects.php", TRUE); @header("Location: $location"); exit; } else { [output the requested file or whatever] }
(This is also an example of a redirect include file which you could insert at the top of a header.php include or so. In fact, you can include this script in some files and call it from .htaccess without modifications.) This method will not work with ASP on IIS because amateurish wannabe Web servers don’t provide the REQUEST_URI variable.
Documentation: When you design or update an information architecture, your documentation should contain a redirect chapter. Also comment all redirects in the source code (your genial regular expressions might lack readability when someone else looks at your code). It’s a good idea to have a documentation file explaining all redirects on the Web server (you might work with other developers when you change your site’s underlying technology in a few years).
Maintenance: Debugging legacy code is a nightmare. And yes, what you write today becomes legacy code in a few years. Thus keep it simple and stupid, implement redirects transparent rather than elegant, and don’t forget that you must change your ancient redirects when you revamp a site area which is the target of redirects.
Performance: Even when performance is an issue, you can’t do everything in httpd.conf. When you for example move a large site changing the URL structure, the redirect logic becomes too complex in most cases. You can’t do database lookups and stuff like that in server configuration files. However, some redirects like for example server name canonicalization should be performed there, because they’re simple and not likely to change. If you can’t change httpd.conf, .htaccess files are for you. They’re are slower than cached config files but still faster than application scripts.

Redirects in server configuration files

Here is an example of a canonicalization redirect in the root’s .htaccess file: RewriteEngine On RewriteCond %{HTTP_HOST} !^sebastians-pamphlets\.com [NC] RewriteRule (.*) http://sebastians-pamphlets.com/$1 [R=301,L]

The first line enables Apache’s mod_rewrite module. Make sure it’s available on your box before you copy, paste and modify the code above.
The second line checks the server name in the HTTP request header (received from a browser, robot, …). The “NC” parameter ensures that the test of the server name (which is, like the scheme part of the URI, not case sensitive by definition) is done as intended. Without this parameter a request of http://SEBASTIANS-PAMPHLETS.COM/ would run in an unnecessary redirect. The rewrite condition returns TRUE when the server name is not sebastians-pamphlets.com. There’s an important detail: not “!”

Most Webmasters do it the other way round. They check if the server name equals an unwanted server name, for example with RewriteCond %{HTTP_HOST} ^www\.example\.com [NC]. That’s not exactly efficient, and fault-prone. It’s not efficient because one needs to add a rewrite condition for each and every server name a user could type in and the Web server would respond to. On most machines that’s a huge list like “w.example.com, ww.example.com, w-w-w.example.com, …” because the default server configuration catches all not explicitely defined subdomains.

Of course next to nobody puts that many rewrite conditions into the .htaccess file, hence this method is fault-prone and not suitable to fix canonicalization issues. In combination with thoughtlessly usage of relative links (bullcrap that most designers and developers love out of lazyness and lack of creativity or at least fantasy), one single link to an existing page on a non-exisiting subdomain not redirected in such an .htaccess file could result in search engines crawling and possibly even indexing a complete site under the unwanted server name. When a savvy competitor spots this exploit you can say good bye to a fair amount of your search engine traffic.

Another advantage of my single line of code is that you can point all domains you’ve registered to catch type-in traffic or whatever to the same Web space. Every new domain runs into the canonicalization redirect, 100% error-free.
The third line performs the 301 redirect to the requested URI using the canonical server name. That means when the request URI was http://www.sebastians-pamphlets.com/about/, the user agent gets redirected to http://sebastians-pamphlets.com/about/. The “R” parameter sets the reponse code, and the “L” parameter means leave if the|one condition matches (=exit), that is the statements following the redirect execution, like other rewrite rules and such stuff, will not be parsed.

If you’ve access to your server’s httpd.conf file (what most hosting services don’t allow), then better do such redirects there. The reason for this recommendation is that Apache must look for .htaccess directives in the current directory and all its upper levels for each and every requested file. If the request is for a page with lots of embedded images or other objects, that sums up to hundreds of hard disk accesses slowing down the page loading time. The server configuration on the other hand is cached and therefore way faster. Learn more about .htaccess disadvantages. However, since most Webmasters can’t modify their server configuration, I provide .htaccess examples only. If you can do, then you know how to put it in httpd.conf.

Redirecting directories and files with .htaccess

When you need to redirect chunks of static pages to another location, the easiest way to do that is Apache’s redirect directive. The basic syntax is Redirect [301|302|307] Path URL, e.g. Redirect 307 /blog/feed http://feedburner.com/myfeed or Redirect 301 /contact.htm /blog/contact/. Path is always a file system path relative to the Web space’s root. URL is either a fully qualified URL (on another machine) like http://feedburner.com/myfeed, or a relative URL on the same server like /blog/contact/ (Apache adds scheme and server in this case, so that the HTTP header is build with an absolute URL in the location field; however, omitting the scheme+server part of the target URL is not recommended, see the warning below).

When you for example want to consolidate a blog on its own subdomain and a corporate Web site at example.com, then put Redirect 301 / http://example.com/blog
in the .htacces file of blog.example.com. When you then request http://blog.example.com/category/post.html you’re redirected to http://example.com/blog/category/post.html.

Say you’ve moved your product pages from /products/*.htm to /shop/products/*.htm then put Redirect 301 /products http://example.com/shop/products

Omit the trailing slashes when you redirect directories. To redirect particular files on the other hand you must fully qualify the locations: Redirect 302 /misc/contact.html http://example.com/cms/contact.php
or, when the new location resides on the same server: Redirect 301 /misc/contact.html /cms/contact.php

Warning: Although Apache allows local redirects like Redirect 301 /misc/contact.html /cms/contact.php, with some server configurations this will result in 500 server errors on all requests. Therefore I recommend the use of fully qualified URLs as redirect target, e.g. Redirect 301 /misc/contact.html http://example.com/cms/contact.php!

Maybe you found a reliable and unbeatable cheap hosting service to host your images. Copy all image files from example.com to image-example.com and keep the directory structures as well as all file names. Then add to example.com’s .htaccess RedirectMatch 301 (.*)\.([Gg][Ii][Ff]|[Pp][Nn][Gg]|[Jj][Pp][Gg])$ http://www.image-example.com$1.$2
The regex should match e.g. /img/nav/arrow-left.png so that the user agent is forced to request http://www.image-example.com/img/nav/arrow-left.png. Say you’ve converted your GIFs and JPGs to the PNG format during this move, simply change the redirect statement to RedirectMatch 301 (.*)\.([Gg][Ii][Ff]|[Pp][Nn][Gg]|[Jj][Pp][Gg])$ http://www.image-example.com$1.png
With regular expressions and RedirectMatch you can perform very creative redirects.

Please note that the response codes used in the code examples above most probably do not fit the type of redirect you’d do in real life with similar scenarios. I’ll discuss use cases for all redirect response codes (301|302|307) later on.

Redirects in server sided scripts

You can do HTTP redirects only with server sided programming languages like PHP, ASP, Perl etcetera. Scripts in those languages generate the output before anything is send to the user agent. It should be a no-brainer, but these PHP examples don’t count as server sided redirects: print "<META HTTP-EQUIV=Refresh CONTENT="0; URL=http://example.com/">\n"; print "<script type="text/javascript">window.location = "http://example.com/";</script>\n";
Just because you can output a redirect with a server sided language that does not make the redirect an HTTP redirect.

In PHP you perform HTTP redirects with the header() function: $newLocation = "http://example.com/"; @header("HTTP/1.1 301 Moved Permanently", TRUE, 301); @header("Location: $newLocation"); exit;
The first input parameter of header() is the complete header line, in the first line of code above that’s the status-line. The second parameter tells whether a previously sent header line shall be replaced (default behavior) or not. The third parameter sets the HTTP status code, don’t use it more than once. If you use an ancient PHP version (prior 4.3.0) you can’t put the 2nd and 3rd input parameter. The “@” suppresses PHP warnings and error messages.

With ColdFusion you code <CFHEADER statuscode="307" statustext="Temporary Redirect"> <CFHEADER name="Location" value="http://example.com/">

A redirecting Perl script begins with #!/usr/bin/perl -w use strict; print "Status: 302 Found Elsewhere\r\n", "Location: http://example.com/\r\n\r\n"; exit;

Even with ASP you can do server sided redirects. VBScript: Dim newLocation newLocation = "http://example.com/" Response.Status = "301 Moved Permanently" Response.AddHeader "Location", newLocation Response.End
JScript: Function RedirectPermanent(newLocation) { Response.Clear(); Response.Status = 301; Response.AddHeader("Location", newLocation); Response.Flush(); Response.End(); } ... Response.Buffer = TRUE; ... RedirectPermanent ("http://example.com/");
Again, if you suffer from IIS/ASP maladies: here you go.

Remember: Don’t output anything before the redirect header, and nothing after the redirect header!

Redirects done by the Web server itself

When you read your raw server logs, you’ll find a few 302 and/or 301 redirects Apache has performed without an explicit redirect statement in the server configuration, .htaccess, or a script. Most of these automatic redirects are the result of a very popular bullshit practice: removing trailing slashes. Although the standard defines that an URI like /directory is not a file name by default, therefore equals /directory/ if there’s no file named /directory, choosing the version without the trailing slash is lazy at least, and creates lots of troubles (404s in some cases, otherwise external redirects, but always duplicate content issues you should fix with URL canonicalization routines).

For example Yahoo is a big fan of truncated URLs. They might save a few terabytes in their indexes by storing URLs without the trailing slash, but they send every user’s browser twice to those locations. Web servers must do a 302 or 301 redirect on each Yahoo-referrer requesting a directory or pseudo-directory, because they can’t serve the default document of an omitted path segment (the path component of an URI begins with a slash, the slash is its segment delimiter, and a trailing slash stands for the last (or only) segment representing a default document like index.html). From the Web server’s perspective /directory does not equal /directory/, only /directory/ addresses /directory/index.(htm|html|shtml|php|...), whereby the file name of the default document must be omitted (among other things to preserve the URL structure when the underlying technology changes). Also, the requested URI without its trailing slash may address a file or an on the fly output (if you make use of mod_rewrite to mask ugly URLs you better test what happens with screwed URIs of yours).

Yahoo wastes even their own resources. Their crawler persistently requests the shortened URL, what bounces with a redirect to the canonical URL. Here is an example from my raw logs: 74.6.20.165 - - [05/Oct/2007:01:13:04 -0400] "GET /directory HTTP/1.0″ 301 26 “-” “Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)” 74.6.20.165 - - [05/Oct/2007:01:13:06 -0400] “GET /directory/ HTTP/1.0″ 200 8642 “-” “Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)” [I’ve replaced a rather long path with “directory”]
If you persistently redirect Yahoo to the canonical URLs (with trailing slash), they’ll use your canonical URLs on the SERPs eventually (but their crawler still requests Yahoo-generated crap). Having many good inbound links as well as clean internal links -all with the trailing slash- helps too, but is not a guarantee for canonical URL normalization at Yahoo.

Here is an example. This URL responds with 200-OK, regardless whether it’s requested with or without the canonical trailing slash:
http://www.jlh-design.com/2007/06/im-confused/
(That’s the default (mis)behavior of everybody’s darling WordPress with permalinks by the way. Here is some PHP canonicalization code to fix this flaw.) All internal links use the canonical URL. I didn’t find a serious inbound link pointing to a truncated version of this URL. Yahoo’s Site Explorer lists the URL without the trailing slash: […]/im-confused, and the same happens on Yahoo’s SERPs: […]/im-confused. Even when a server responds 200-OK to two different URLs, a serious search engine should normalize according to the internal links as well as an entry in the XML sitemap, therefore choose the URL with the trailing slash as canonical URL.

Fucking up links on search result pages is evil enough, although fortunately this crap doesn’t influence discovery crawling directly because those aren’t crawled by other search engines (but scraped or syndicated search results are crawlable). Actually, that’s not the whole horror story. Other Yahoo properties remove the trailing slashes from directory and home page links too (look at the “What Readers Viewed” column in your MBL stats for example), and some of those services provide crawlable pages carrying invalid links (pulled from the search index or screwed otherwise). That means other search engines pick those incomplete URLs from Yahoo’s pages (or other pages with links copied from Yahoo pages), crawl them, and end up with search indexes blown up with duplicate content. Maybe Yahoo does all that only to burn Google’s resources by keeping their canonicalization routines and duplicate content filters busy, but it’s not exactly gentlemanlike that such cat fights affect all Webmasters across the globe. Yahoo directly as well as indirectly burns our resources with unnecessary requests of screwed URLs, and we must implement sanitizing redirects for software like WordPress -which doesn’t care enough about URL canonicalization-, just because Yahoo manipulates our URLs to peeve Google. Doh!

If somebody from Yahoo (or MSN, or any other site manipulating URLs this way) reads my rant, I highly recommend this quote from Tim Berners-Lee (January 2005):

Scheme-Based Normalization
[…] the following […] URIs are equivalent:
http://example.com
http://example.com/
In general, an URI that uses the generic syntax for authority with an empty path should be normalized to a path of “/”.
[…]
Normalization should not remove delimiters [”/” or “?”] when their associated component is empty unless licensed to do so by the scheme specification. [emphasis mine]

In my book sentences like “Note that the absolute path cannot be empty; if none is present in the original URI, it MUST be given as ‘/’ […]” in the HTTP specification as well as Section 3.3 of the URI’s Path Segment specs do not sound like a licence to screw URLs. Omitting the path segment delimiter “/” representing an empty last path segment might sound legal if the specs are interpreted without applying common sense, but knowing that Web servers can’t respond to requests of those incomplete URIs and nevertheless truncating trailing slashes is a brain dead approach (actually, such crap deserves a couple unprintable adjectives).

Frequently scanning the raw logs for 302/301 redirects is a good idea. Also, implement documented canonicalization redirects when a piece of software responds to different versions of URLs. It’s the Webmaster’s responsibility to ensure that each piece of content is available under one and only one URL. You cannot rely on any search engine’s URL canonicalization, because shit happens, even with high sophisticated algos:

When search engines crawl identical content through varied URLs, there may be several negative effects:

1. Having multiple URLs can dilute link popularity. For example, in the diagram above [example in Google’s blog post], rather than 50 links to your intended display URL, the 50 links may be divided three ways among the three distinct URLs.

2. Search results may display user-unfriendly URLs […]

Redirect or not? A few use cases.

Before I blather about the three redirect response codes you can choose from, I’d like to talk about a few situations where you shall not redirect, and cases where you probably don’t redirect but should do so.

Unfortunately, it’s a common practice to replace various sorts of clean links with redirects. Whilst legions of Webmasters don’t obfuscate their affiliate links, they hide their valuable outgoing links in fear of PageRank leaks and other myths, or react to search engine FUD with castrated links.

With very few exceptions, the A Element a.k.a. Hyperlink is the best method to transport link juice (PageRank, topical relevancy, trust, reputation …) as well as human traffic. Don’t abuse my beloved A Element: <a onclick="window.location = 'http://example.com/'; return false;" title="http://example.com">bad example</a>
Such a “link” will transport some visitors, but does not work when JavaScript is disabled or the user agent is a Web robot. This “link” is not an iota better: <a href="http://example.com/blocked-directory/redirect.php?url=http://another-example.com/" title="Another bad example">example</a>

Simplicity pays. You don’t need the complexity of HREF values changed to ugly URLs of redirect scripts with parameters, located in an uncrawlable path, just because you don’t want that search engines count the links. Not to speak of cases where redirecting links is unfair or even risky, for example click tracking scripts which do a redirect.

If you need to track outgoing traffic, then by all means do it in a search engine friendly way with clean URLs which benefit the link destination and don’t do you any harm, here is a proven method.
If you really can’t vouch for a link, for example because you link out to a so called bad neighborhood (whatever that means), or to a link broker, or to someone who paid for the link and Google can detect it or a competitor can turn you in, then add rel=”nofollow” to the link. Yeah, rel-nofollow is crap … but it’s there, it works, we won’t get something better, and it’s less complex than redirects, so just apply it to your fishy links as well as to unmoderated user input.
If you decide that an outgoing link adds value for your visitors, and you personally think that the linked page is a great resource, then almost certainly search engines will endorse the link (regardless whether it shows a toolbar PR or not). There’s way too much FUD and crappy advice out there.
You really don’t lose PageRank when you link out. Honestly gained PageRanks sticks at your pages. You only lower the amount of PageRank you can pass to your internal links a little. That’s not a bad thing, because linking out to great stuff can bring in more PageRank in the form of natural inbound links (there are other advantages too). Also, Google dislikes PageRank hoarding and the unnatural link patterns you create with practices like that.
Every redirect slows things down, and chances are that a user agent messes with the redirect what can result in rendering nil, scrambled stuff, or something completely unrelated. I admit that’s not a very common problem, but it happens with some outdated though still used browsers. Avoid redirects where you can.

In some cases you should perform redirects for sheer search engine compliance, in other words selfish SEO purposes. For example don’t let search engines handle your affiliate links.

If you operate an affiliate program, then internally redirect all incoming affiliate links to consolidate your landing page URLs. Although incoming affiliate links don’t bring much link juice, every little helps when it lands on a page which doesn’t credit search engine traffic to an affiliate.
Search engines are pretty smart when it comes to identifying affiliate links. (Thin) affiliate sites suffer from decreasing search engine traffic. Fortunately, the engines respect robots.txt, that means they usually don’t follow links via blocked subdirectories. When you link to your merchants within the content, using URLs that don’t smell like affiliate links, it’s harder to detect the intention of those links algorithmically. Of course that doesn’t prevent you from smart algos trained to spot other patterns, and this method will not pass reviews by humans, but it’s worth a try.
If you’ve pages which change their contents often by featuring for example a product of the day, you might have a redirect candidate. Instead of duplicating a daily changing product page, you can do a dynamic soft redirect to the product pages. Whether a 302 or a 307 redirect is the best choice depends on the individual circumstances. However, you can promote the hell out of the redirecting page, so that it gains all the search engine love without passing on PageRank etc. to product pages which phase out after a while. (If the product page is hosted by the merchant you must use a 307 response code. Otherwise make sure the 302′ing URL ist listed in your XML sitemap with a high priority. If you can, send a 302 with most HTTP/1.0 requests, and a 307 responding to HTTP/1.1 requests. See the 302/307 sections for more information.)
If an URL comes with a session-ID or another tracking variable in its query string, you must 301-redirect search engine crawlers to an URI without such randomly generated noise. There’s no need to redirect a human visitor, but search engines hate tracking variables so just don’t let them fetch such URLs.
There are other use cases involving creative redirects which I’m not willing to discuss here.

Of course both lists above aren’t complete.

Choosing the best redirect response code (301, 302, or 307)

Choosing a redirect response code I’m sick of articles like “search engine friendly 301 redirects” propagating that only permanant redirects work with search engines. That’s a lie. I read those misleading headlines daily on the webmaster boards, in my feed reader, at Sphinn, and elsewhere … and I’m not amused. Lemmings. Amateurish copycats. Clueless plagiarists. [Insert a few lines of somewhat offensive language and swearing ]

Of course most redirects out there return the wrong response code. That’s because the default HTTP response code for all redirects is 302, and many code monkeys forget to send a status-line providing the 301 Moved Permanantly when an URL was actually moved or the requested URI is not the canonical URL. When a clueless coder or hosting service invokes a Location: http://example.com/ header statement without a previous HTTP/1.1 301 Moved Permanantly status-line, the redirect becomes a soft 302 Found. That does not mean that 302 or 307 redirects aren’t search engine friendly at all. All HTTP redirects can be safely used with regard to search engines. The point is that one must choose the correct response code based on the actual circumstances and goals. Blindly 301′ing everything is counterproductive sometimes.

301 - Moved Permanently

301 Moved Permanently The message of a 301 reponse code to the requestor is: “The requested URI has vanished. It’s gone forever and perhaps it never existed. I will never supply any contents under this URI (again). Request the URL given in location, and replace the outdated respectively wrong URL in your bookmarks/records by the new one for future requests. Don’t bother me again. Farewell.”

Lets start with the definition of a 301 redirect quoted from the HTTP/1.1 specifications:

The requested resource has been assigned a new permanent URI and any future references to this resource SHOULD use one of the returned URIs [(1)]. Clients with link editing capabilities ought to automatically re-link references to the Request-URI to one or more of the new references returned by the server, where possible. This response is cacheable unless indicated otherwise.

The new permanent URI SHOULD be given by the Location field in the response. Unless the request method was HEAD, the entity of the response SHOULD contain a short hypertext note with a hyperlink to the new URI(s). […]

Read a polite “SHOULD” as “must”.

(1) Although technically you could provide more than one location, you must not do that because it irritates too many user agents, search engine crawlers included.

Make use of the 301 redirect when a requested Web resource was moved to another location, or when a user agent requests an URI which is definitely wrong and you’re able to tell the correct URI with no doubt. For URL canonicalization purposes (more info here) the 301 redirect is your one and only friend.

You must not recycle any 301′ing URLs, that means once an URL responds with 301 you must stick with it, you can’t reuse this URL for other purposes next year or so.

Also, you must maintain the 301 response and a location corresponding to the redirecting URL forever. That does not mean that the location can’t be changed. Say you’ve moved a contact page /contact.html to a CMS where it resides under /cms/contact.php. If a user agent requests /contact.html it does a 301 redirect pointing to /cms/contact.php. Two years later you change your software again, and the contact page moves to /blog/contact/. In this case you must change the initial redirect, and create a new one:
/contact.html 301-redirects to /blog/contact/, and
/cms/contact.php 301-redirects to /blog/contact/.
If you keep the initial redirect /contact.html to /cms/contact.php, and redirect /cms/contact.php to /blog/contact/, you create a redirect chain which can deindex your content at search engines. Well, two redirects before a crawler reaches the final URL shouldn’t be a big deal, but add a canonicalization redirect fixing a www vs. non-www issue to the chain, and imagine a crawler comes from a directory or links list which counts clicks with a redirect script, you’ve four redirects in a row. That’s too much, most probably all search engines will not index such an unreliable Web resource.

301 redirects transfer search engine love like PageRank gathered by the redirecting URL to the new location, but the search engines keep the old URL in their indexes, and revisit it every now and then to check whether the 301 redirect is stable or not. If the redirect is gone on the next crawl, the new URL loses the reputation earned from the redirect’s inbound links. It’s impossible to get all inbound links changed, hence don’t delete redirects after a move.

It’s a good idea to check your 404 logs weekly or so, because search engine crawlers pick up malformed links from URL drops and such. Even when the link is invalid, for example because a crappy forum software has shortened the URL, it’s an asset you should not waste with a 404 or even 410 response. Find the best matching existing URL and do a 301 redirect.

Here is what Google says about 301 redirects:

[Source] 301 (Moved permanently) […] You should use this code to let Googlebot know that a page or site has permanently moved to a new location. […]

[Source …] If you’ve restructured your site, use 301 redirects (”RedirectPermanent”) in your .htaccess file to smartly redirect users, Googlebot, and other spiders. (In Apache, you can do this with an .htaccess file; in IIS, you can do this through the administrative console.) […]

[Source …] If your old URLs redirect to your new site using HTTP 301 (permanent) redirects, our crawler will discover the new URLs. […] Google listings are based in part on our ability to find you from links on other sites. To preserve your rank, you’ll want to tell others who link to you of your change of address. […]

[Source …] If your site [or page] is appearing as two different listings in our search results, we suggest consolidating these listings so we can more accurately determine your site’s [page’s] PageRank. The easiest way to do so [on site level] is to set the preferred domain using our webmaster tools. You can also redirect one version [page] to the other [canonical URL] using a 301 redirect. This should resolve the situation after our crawler discovers the change. […]

That’s exactly what the HTTP standard wants a search engine to do. Yahoo handles 301 redirects a little different:

[Source …] When one web page redirects to another web page, Yahoo! Web Search sometimes indexes the page content under the URL of the entry or “source” page, and sometimes index it under the URL of the final, destination, or “target” page. […]

When a page in one domain redirects to a page in another domain, Yahoo! records the “target” URL. […]

When a top-level page [http://example.com/] in a domain presents a permanent redirect to a page deep within the same domain, Yahoo! indexes the “source” URL. […]

When a page deep within a domain presents a permanent redirect to a page deep within the same domain, Yahoo! indexes the “target” URL. […]

Because of mapping algorithms directing content extraction, Yahoo! Web Search is not always able to discard URLs that have been seen as 301s, so web servers might still see crawler traffic to the pages that have been permanently redirected. […]

As for the non-standard procedure to handle redirecting root index pages, that’s not a big deal, because in most cases a site owner promotes the top level page anyway. Actually, that’s a smart way to “break the rules” for the better. The way too many requests of permanently redirecting pages are more annoying.

Moving sites with 301 redirects

When you restructure a site, consolidate sites or separate sections, move to another domain, flee from a free host, or do other structural changes, then in theory you can install page by page 301 redirects and you’re done. Actually, that works but comes with disadvantages like a total loss of all search engine traffic for a while. As larger the site, as longer the while. With a large site highly dependent on SERP referrers this procedure can be the first phase of a filing for bankruptcy plan, because all search engines don’t send (much) traffic during the move.

Lets look at the process from a search engine’s perspective. The crawling of old.com all of a sudden bounces at 301 redirects to new.com. None of the redirect targets is known to the search engine. The crawlers report back redirect responses and the new URLs as well. The indexers spotting the redirects block the redirecting URLs for the query engine, but can’t pass the properties (PageRank, contextual signals and so on) of the redirecting resources to the new URLs, because those aren’t crawled yet.

The crawl scheduler initiates the handshake with the newly discovered server to estimate its robustness, and most propably does a conservative guess of the crawl frequency this server can sustain. The queue of uncrawled URLs belonging to the new server grows way faster than the crawlers actually deliver the first contents fetched from the new server.

Each and every URL fetched from the old server vanishes from the SERPs in no time, whilst the new URLs aren’t crawled yet, or are still waiting for an idle indexer able to assign them the properties of the old URLs, doing heuristic checks on the stored contents from both URLs and whatnot.

Slowly, sometimes weeks after the begin of the move, the first URLs from the new server populate the SERPs. They don’t rank very well, because the search engine has not yet discovered the new site’s structure and linkage completely, so that a couple of ranking factors stay temporairily unconsidered. Some of the new URLs may appear as URL-only listing, solely indexed based on off-page factors, hence lacking the ability to trigger search query relevance for their contents.

Many of the new URLs can’t regain their former PageRank in the first reindexing cycle, because without a complete survey of the “new” site’s linkage there’s only the PageRank from external inbound links passed by the redirects available (internal links no longer count for PageRank when the search engine discovers that the source of internally distributed PageRank does a redirect), so that they land in a secondary index.

Next, the suddenly lower PageRank results in a lower crawling frequency for the URLs in question. Also, the process removing redirecting URLs still runs way faster than the reindexing of moved contents from the new server. As more URLs are involved in a move, as longer the reindexing and reranking lasts. Replace Google’s very own PageRank with any term and you’ve a somewhat usable description of a site move handled by Yahoo, MSN, or Ask. There are only so many ways to handle such a challenge.

That’s a horror scenario, isn’t it? Well, at Google the recently changed infrastructure has greatly improved this process, and other search engines evolve too, but moves as well as significant structural changes will always result in periods of decreased SERP referrers, or even no search engine traffic at all.

Does that mean that big moves are too risky, or even not doable? Not at all. You just need deep pockets. If you lack a budget to feed the site with PPC or other bought traffic to compensate an estimated loss of organic traffic lasting at least a few weeks, but perhaps months, then don’t move. And when you move, then set up a professionally managed project, and hire experts for this task.

Here are some guidelines. I don’t provide a timeline, because that’s impossible without detailed knowledge of the individual circumstances. Adapt the procedure to fit your needs, nothing’s set in stone.

Set up the site on the new Web server (new.com). In robots.txt block everything exept a temporary page telling that this server is the new home of your site. Link to this page to get search engines familiar with the new server, but make sure there are no links to blocked content yet.
Create mapping tables “old URL to new URL” (respectively algos) to prepare the 301 redirects etcetera. You could consolidate multiple pages under one redirect target and so on, but you better wait with changes like that. Do them after the move. When you keep the old site’s structure on the new server, you make the job easier for search engines.
If you plan to do structural changes after the move, then develop the redirects in a way that you can easily change the redirect targets on the old site, and prepare the internal redirects on the new site as well. In any case, your redirect routines must be able to redirect or not depending on parameters like site area, user agent / requestor IP and such stuff, and you need a flexible control panel as well as URL specific crawler auditing on both servers.
On old.com develop a server sided procedure which can add links to the new location on every page on your old domain. Identify your URLs with the lowest crawling frequency. Work out a time table for the move which considers page importance (with regard search engine traffic), and crawl frequency.
Remove the Disallow: statements in the new server’s robots.txt. Create one or more XML sitemap(s) for the new server and make sure that you set crawl-priority and change-frequency accurately, last-modified gets populated with the scheduled begin of the move (IOW the day the first search engine crawler can access the sitemap). Feed the engines with sitemap files listing the important URLs first. Add sitemap-autodiscovery statements to robots.txt, and manually submit the sitemaps to Google and Yahoo.
Fire up the scripts creating visible “this page will move to [new location] soon” links on the old pages. Monitor the crawlers on the new server. Don’t worry about duplicate content issues in this phase, “move” in the anchor text is a magic word. Do nothing until the crawlers have fetched at least the first and second link level on the new server, as well as most of the important pages.
Briefly explain your redirect strategy in robots.txt comments on both servers. If you can, add obversely HTML comments to the HEAD section of all pages on the old server. You will cloak for a while, and things like that can help to pass reviews by humans which might get an alert from an algo or spam report. It’s more or less impossible to redirect human traffic in chunks, because that results in annoying surfing experiences, inconsistent database updates, and other disadvantages. Search engines aren’t cruel and understand that.
301 redirect all human traffic to the new server. Serve search engines the first chunk of redirecting pages. Start with a small chunk of not more than 1,000 pages or so, and bundle related pages to preserve most of the internal links within each chunk.
Closely monitor the crawling and indexing process of the first chunk, and don’t release the next one before it has (nearly) finished. Probably it’s necessary to handle each crawler individually.
Whilst you release chunk after chunk of redirects to the engines adjusting the intervals based on your experiences, contact all sites linking to you and ask for URL updates (bear in mind to delay these requests for inbound links pointing to URLs you’ll change after the move for other reasons). It helps when you offer an incentive, best let your marketing dept. handle this task (having a valid reason to get in touch with those Webmasters might open some opportunities).
Support the discovery crawling based on redirects and updated inbound links by releasing more and more XML sitemaps on the new server. Enabling sitemap based crawling should somewhat correlate to your release of redirect chunks. Both discovery crawling and submission based crawling share the bandwith respectively the amount of daily fetches the crawling engine has determined for your new server. Hence don’t disturb the balance by submitting sitemaps listing 200,000 unimportant 5th level URLs whilst a crawler processes a chunk of landing pages promoting your best selling products. You can steer sitemap autodiscovery depending on the user agent (for MSN and Ask which don’t offer submit forms) in your robots.txt, in combination with submissions to Google and Yahoo. Don’t forget to maintain (delete or update frequently) the sitemaps after the move.
Make sure you can control your redirects forever. Pay the hosting service and the registrar of the old site for the next ten years upfront.

Of course there’s no such thing as a bullet-proof procedure to move large sites, but you can do a lot to make the move as smoothly as possible.

302 - Found [Elsewhere]

302 Found Elsewhere The 302 redirect, like the 303/307 response code, is kinda soft redirect. Whilst a 301-redirect indicates a hard redirect by telling the user agent that a requested address is outdated (should be deleted) and the resource must be requested under another URL, 302 (303/307) redirects can be used with URLs which are valid, and should be kept by the requestor, but don’t deliver content at the time of the request. In theory, a 302′ing URL could redirect to another URL with each and every request, and even serve contents itself every now and then.

Whilst that’s no big deal with user agents used by humans (browsers, screen readers), search engines crawling and indexing contents by following paths to contents which must be accessible for human surfers consider soft redirects unreliable by design. What makes indexing soft redirets a royal PITA is the fact that most soft redirects actually are meant to notify a permanent move. 302 is the default response code for all redirects, setting the correct status code is not exactly popular in developer crowds, so that gazillions of 302 redirects are syntax errors which mimic 301 redirects.

Search engines have no other chance than requesting those wrongly redirecting URLs over and over to persistently check whether the soft redirect’s functionality sticks with the implied behavior of a permanent redirect.

Also, way back when search engines interpreted soft redirects according to the HTTP standards, it was possible to hijack foreign resources with a 302 redirect and even meta refreshes. That means that a strong (high PageRank) URL 302-redirecting to a weaker (lower PageRank) URL on another server got listed on the SERPs with the contents pulled from the weak page. Since Internet marketers are smart folks, this behavior enabled creative content delivery: of course only crawlers saw the redirect, humans got a nice sales pitch.

With regard to search engines, 302 redirects should be applied very carefully, because ignorant developers and, well, questionable intentions, have forced the engines to handle 302 redirects in a way that’s not exactly compliant to Web standards, but meant to be the best procedure to fit a searchers interests. When you do cross-domain 302s, you can’t predict whether search engines pick the source, the target, or even a completely different but nice looking URL from the target domain on their SERPs. In most cases the target URL of 302-redirects gets indexed, but according to Murphy’s law and experience of life “99%” leaves enough room for serious messups.

Partly the common 302-confusion is based on the HTTP standard(s). With regard to SEO, response codes usable with GET and HEAD requests are more important, so I simplify things by ignoring issues with POST requests. Lets compare the definitions:

HTTP/1.0 HTTP/1.1

302 Moved Temporarily

The requested resource resides temporarily under a different URL. Since the redirection may be altered on occasion, the client should continue to use the Request-URI for future requests.

The URL must be given by the Location field in the response. Unless it was a HEAD request, the Entity-Body of the response should contain a short note with a hyperlink to the new URI(s).
302 Found

The requested resource resides temporarily under a different URI. Since the redirection might be altered on occasion, the client SHOULD continue to use the Request-URI for future requests. This response is only cacheable if indicated by a Cache-Control or Expires header field.

The temporary URI SHOULD be given by the Location field in the response. Unless the request method was HEAD, the entity of the response SHOULD contain a short hypertext note with a hyperlink to the new URI(s).

HTTP/1.0		HTTP/1.1
302 Moved Temporarily The requested resource resides temporarily under a different URL. Since the redirection may be altered on occasion, the client should continue to use the Request-URI for future requests. The URL must be given by the Location field in the response. Unless it was a HEAD request, the Entity-Body of the response should contain a short note with a hyperlink to the new URI(s).		302 Found The requested resource resides temporarily under a different URI. Since the redirection might be altered on occasion, the client SHOULD continue to use the Request-URI for future requests. This response is only cacheable if indicated by a Cache-Control or Expires header field. The temporary URI SHOULD be given by the Location field in the response. Unless the request method was HEAD, the entity of the response SHOULD contain a short hypertext note with a hyperlink to the new URI(s).

First, there’s a changed reason phrase for the 302 response code. “Moved Temporarily” became “Found” (”Found Elsewhere”), and a new response code 307 labelled “Temporary Redirect” was introduced (the other new response code 303 “See Other” is for POST results redirecting to a resource which requires a GET request).

Creatively interpreted, this change could indicate that we should replace 302 redirects applied to temporarily moved URLs with 307 redirects, reserving the 302 response code for hiccups and redirects done by the Web server itself -without an explicit redirect statement in the server’s configuration (httpd.conf or .htaccess)-, for example in response to requests of maliciously shortened URIs (of course a 301 is the right answer in this case, but some servers use the “wrong” 302 response code by default to err on the side of caution until the Webmaster sets proper canonicalization redirects returning 301 response codes).

Strictly interpreted, this change tells us that the 302 response code must not be applied to moved URLs, regardless whether the move is really a temporary replacement (during maintenance windows, to point to mirrors of pages on overcrowded servers during traffic spikes, …) or even a permanent forwarding request where somebody didn’t bother sending a status line to qualify the location directive. As for maintenance, better use 503 “Service Unavailable”!

Another important change is the addition of the non-cachable instruction in HTTP/1.1. Because the HTTP/1.0 standard didn’t explicitely state that the URL given in location must not be cached, some user agents did so, and the few Web developers actually reading the specs thought they’re allowed to simplify their various redirects (302′ing everything), because in the eyes of a developer nothing is really there to stay (SEOs, who handle URLs as assets, often don’t understand this philosophy, thus sadly act confrontational instead of educational).

Having said all that, is there still a valid use case for 302 redirects? Well, since 307 is an invalid response code with HTTP/1.0 requests, and crawlers still perform those, there’s no alternative to 302. Is that so? Not really, at least not when you’re dealing with overcautious search engine crawlers. Most HTTP/1.0 requests from search engines are faked, that means the crawler understands everything HTTP/1.1 but sends an HTTP/1.0 request header just in case the server runs since the Internet’s stone age without any upgrades. Yahoo’s Slurp for example does faked HTTP/1.0 requests in general, whilst you can trust Ms. Googlebot’s request headers. If Google’s crawler does an HTTP/1.0 request, that’s either testing the capabilities of a newly discovered server, or something went awfully wrong, usually on your side.

Google’s as well as Yahoo’s crawlers understand both the 302 and the 307 redirect (there’s no official statement from Yahoo though). But there are other Web robots out there (like link checkers of directories or similar bots send out by site owners to automatically remove invalid as well as redirecting links), some of them consisting of legacy code. Not to speak of ancient browsers in combination with Web servers which don’t add the hyperlink piece to 307 responses. So if you want to do everything the right way, you send 302 responses to HTTP/1.0 requestors -except when the user agent and the IP address identify a major search engine’s crawler-, and 307 responses to everything else -except when the HTTP/1.1 user agent lacks understanding of 307 response codes-. Ok, ok, ok … you’ll stick with the outdated 302 thingy. At least you won’t change old code just to make it more complex than necessary. With newish applications, which rely on state of the art technologies like AJAX anyway, you can quite safely assume that the user agents understand the 307 response, hence go for it and bury the wrecked 302, but submit only non-redirecting URLs to other places.

Here is how Google handles 302 redirects:

[Source …] you shouldn’t use it to tell the Googlebot that a page or site has moved because Googlebot will continue to crawl and index the original location.

Well, that’s not much info, and obviously a false statement. Actually, Google continues to crawl the redirecting URL, then indexes the source URL with the target’s content from redirects within a domain or subdomain only -but not always-, and mostly indexes the target URL and its content when a 302 redirect leaves the domain of the redirecting URL -if not any other URL redirecting to the same location or serving the same content looks prettier-. In most cases Google indexes the content served by the target URL, but in some cases all URL candidates involved in a redirect lose this game in favor of another URL Google has discovered on the target server (usually a short and pithy URL).

Like with 301 redirects, Yahoo “breaks the rules” with 302 redirects too:

[Source …] When one web page redirects to another web page, Yahoo! Web Search sometimes indexes the page content under the URL of the entry or “source” page, and sometimes index it under the URL of the final, destination, or “target” page. […]

When a page in one domain redirects to a page in another domain, Yahoo! records the “target” URL. […]

When a page in a domain presents a temporary redirect to another page in the same domain, Yahoo! indexes the “source” URL.

Yahoo! Web Search indexes URLs that redirect according to the general guidelines outlined above with the exception of special cases that might be read and indexed differently. […]

One of these cases where Yahoo handles redirects “differently” (meaning according to the HTTP standards) is a soft redirect from the root index page to a deep page. Like with a 301 redirect, Yahoo indexes the home page URL with the contents served by the redirect’s target.

You see that there are not that much advantages of 302 redirects pointing to other servers. Those redirects are most likely understood as somwhat permanent redirects, what means that the engines most probably crawl the redirecting URLs in a lower crawl frequency than 307 redirects.

If you have URLs which change their contents quite frequently by redirecting to different resources (from the same domain or on another server), and you want search engines to index and rank those timely contents, then consider the hassles of IP/UA based response codes depending on the protocol version. Also, feed those URLs with as much links as you can, and list them in an XML sitemap with a high priority value, a last modified timestamp like request timestamp minus a few seconds, and an “always”, “hourly” or “daily” change frequency tag. Do that even when you for whatever reasons have no XML-sitemap at all. There’s no better procedure to pass such special instructions to crawlers, even an XML sitemap listing only the ever changing URLs should do the trick.

If you promote your top level page but pull the contents from deep pages or scripts, then a 302 meant as 307 from the root to the output device is a common way to avoid duplicate content issues while serving contents depending on other request signals than the URI alone (cookies, geo targeting, referrer analysis, …). However, that’s a case where you can avoid the redirect. Duplicating one deep page’s content on root level is a non-issue, a superfluous redirect is an issue with regard to performance at least, and it sometimes slows down crawling and indexing. When you output different contents depending on user specific parameters, treating crawlers as users is easy to accomplish. I’d just make the root index default document a script outputting the former redirect’s target. That’s a simple solution without redirecting anyone (which sometimes directly feeds the top level URL with PageRank from user links to their individual “home pages”).

307 - Temporary Redirect

307 Temporary Redirect Well, since the 307 redirect is the 302’s official successor, I’ve told you nearly everything about it in the 302 section. Here is the HTTP/1.1 definition:

307 Temporary Redirect

The requested resource resides temporarily under a different URI. Since the redirection MAY be altered on occasion, the client SHOULD continue to use the Request-URI for future requests. This response is only cacheable if indicated by a Cache-Control or Expires header field.

The temporary URI SHOULD be given by the Location field in the response. Unless the request method was HEAD, the entity of the response SHOULD contain a short hypertext note with a hyperlink to the new URI(s), since many pre-HTTP/1.1 user agents do not understand the 307 status. Therefore, the note SHOULD contain the information necessary for a user to repeat the original request on the new URI.

The 307 redirect was introduced with HTTP/1.1, hence some user agents doing HTTP/1.0 requests do not understand it. Some! Actually, many user agents fake the protocol version in order to avoid conflicts with older Web servers. Search engines like Yahoo for example perform faked HTTP/1.0 requests in general, although their crawlers do talk HTTP/1.1. If you make use of the feedburner plugin to redirect your WordPress feeds to feedburner.com/yourfeed, respectively feeds.yourdomain.com resolving to feedburner.com/yourfeed, you’ll notice that Yahoo bots do follow 307 redirects, although Yahoo’s official documentation does not even mention the 307 response code.

Google states how they handle 307 redirects as follows:

[Source …] The server is currently responding to the request with a page from a different location, but the requestor should continue to use the original location for future requests. This code is similar to a 301 in that for a GET or HEAD request, it automatically forwards the requestor to a different location, but you shouldn’t use it to tell the Googlebot that a page or site has moved because Googlebot will continue to crawl and index the original location.

Well, a summary of the HTTP standard plus a quote from the 302 page is not exactly considered a comprehensive help topic. However, checked with the feedburner example, Google understands 307s as well.

A 307 should be used when a particular URL for whatever reason must point to an external resource. When you for example burn your feeds, redirecting your blog software’s feed URLs with a 307 response code to “your” feed at feedburner.com or another service is the way to go. In this case it plays no role that many HTTP/1.0 user agents don’t know shit about the 307 response code, because all software dealing with RSS feeds can understand and handle HTTP/1.1 response codes, or at least can interpret the class 3xx and request the feed from the URI provided in the header’s location field. More important, because with a 307 redirect each revisit has to start at the redirecting URL to fetch the destination URI, you can move your burned feed to another service, or serve it yourself, whenever you choose to do so, without dealing with longtime cache issues.

302 temporary redirects might result in cached addresses from the location’s URL due to an unprecise specification in the HTTP/1.0 protocol, but that shouldn’t happen with HTTP/1.1 response codes which, in the 3xx class, all clearly tell what’s cachable and what not.

When your site’s logs show a tiny amount of actual HTTP/1.0 requests (eliminate crawlers of major search engines for this report), you really should do 307 redirects instead of wrecked 302s. Of course, avoiding redirects where possible is always the better choice, and don’t apply 307 redirects to moved URLs.

Recap

301-302-307-redirect-recap Here are the bold sentences again. Hop to the sections via the table of contents.

Avoid redirects where you can. URLs, especially linked URLs, are assets. Often you can include other contents instead of performing a redirect to another resource. Also, there are hyperlinks.
Search engines process HTTP redirects (301, 302 and 307) as well as meta refreshes. If you can, always go for the cleaner server sided redirect.
Always redirect to the final destination to avoid useless hops which kill your search engine traffic. With each and every revamp that comes with URL changes check for incoming redirects and make sure that you eliminate unnecessary hops.
You must maintain your redirects forever, and you must not remove (permanent) redirects. Document all redirects, especially when you do redirects both in the server configuration as well as in scripts.
Check your logs for redirects done by the Web server itself and unusual 404 errors. Vicious Web services like Yahoo or MSN screw your URLs to get you in duplicate content troubles with Google.
Don’t track links with redirecting scripts. Avoid redirect scripts in favor of link attributes. Don’t hoard PageRank by routing outgoing links via an uncrawlable redirect script, don’t buy too much of the search engine FUD, and don’t implement crappy advice from Webmaster hangouts.
Clever redirects are your friend when you handle incoming and outgoing affiliate links. Smart IP/UA based URL cloaking with permanent redirects makes you independent from search engine canonicalization routines which can fail, and improves your overall search engine visibility.
Do not output anything before an HTTP redirect, and terminate the script after the last header statement.
For each server sided redirect, send an HTTP status line with a well choosen response code, and an absolute (fully qualified) URL in the location field. Consider tagging the redirecting script in the header (X-Redirect-Src).
Put any redirect logic at the very top of your scripts. Encapsulate redirect routines. Performance is not everything, transparency is important when the shit hits the fan.
Test all your redirects with server header checkers for the right response code and a working location. If you forget an HTTP status line, you get a 302 redirect regarless your intention.
With canonicalization redirects use not equal conditions to cover everything. Most .htaccess code posted on Webmaster boards, supposed to fix for example www vs. non-www issues, is unusable. If you reply “thanks” to such a post with your URL in the signature, you invite saboteurs to make use of the exploits.
Use only 301 redirects to handle permanently moved URLs and canonicalization. Use 301 redirects only for persistent decisions. In other words, don’t blindly 301 everything.
Don’t redirect too many URLs simultaneous, move large amounts of pages in smaller chunks.
99% of all 302 redirects are either syntax errors or semantically crap, but there are still some use cases for search engine friendly 302 redirects. “Moved URLs” is not on that list.
The 307 redirect can replace most wrecked 302 redirects, at least in current environments.
Search engines do not handle redirects according to the HTTP specs any more. At least not when a redirect points to an external resource.

I’ve asked Google in their popular picks campaign for a comprehensive write-up on redirects (what is part of the ongoing help system revamp anyway, but I’m either greedy or not patient enough). If my question gets picked, I’ll update this post.

Did I forget anything else? If so, please submit a comment.

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

46 comments Sebastian | Web development, Redirects, Cloaking, .htaccess, Yahoo, SEO, Google

Google and Yahoo accept undelayed meta refreshs as 301 redirects

Posted on 3 September, 2007

Although the meta refresh often gets abused to trick visitors into popup hells by sneaky pages on low-life free hosts (poor man’s cloaking), search engines don’t treat every instance of the meta refresh as Webspam. Folks moving their free hosted stuff to their own domains rely on it to redirect to the new location: <meta http-equiv=refresh content="0; url=http://example.com/newurl" />
Yahoo clearly states how they treat a zero meta refresh, that is a redirect with a delay of zero seconds:

META Refresh: <meta http-equiv=”refresh” content=…> is recognized as a 301 if it specifies little or no delay or as a 302 if it specifies noticeable delay.

Google is in the process of rewriting their documentation, in the current version of their help documents the meta refresh is not (yet!) mentioned. The Google Mini treats all meta refreshs as 302:

A META tag that specifies http-equiv=”refresh” is handled as a 302 redirect.

but that’s handled differently on the Web. I’ve asked Google’s search evangelist Adam Lasnik and he said:

[The] best idea is to use 301/302s directly whenever possible; otherwise, next best is to do a metarefresh with 0 for a 301. I don’t believe we recommend or support any 302-alternative.

Thanks Adam! I’ll update the last meta refresh thread.

If you have the chance to do 301 redirects don’t mess with the meta refresh. Utilize this method only when there’s absolutely no other chance.

Full stop for search geeks. What follows is an explanation for not that experienced Webmasters in need to move their stuff away from greedy Web content funeral services, aka free hosts of any sort.

Ok, now that we know the major search engines accept an undelayed meta refresh as poor man’s 301 redirect, how should a page having this tag look like in order to act as a provisional permanent redirect? As plain and functional as possible: <html> <head> <title>Moved to new URL: http://example.com/newurl</title> <meta http-equiv=refresh content="0; url=http://example.com/newurl" /> <meta name="robots" content="noindex,follow" /> </head> <body> <h1>This page has been moved to http://example.com/newurl</h1> If your browser doesn't redirect you to the new location please <a href="http://example.com/newurl">click here</a>, sorry for the hassles! </body> </html>
As long as the server delivers the content above under the old URL sending a 200-OK, Google’s crawl stats should not list the URL under 404 errors. If it does appear under “Not found”, something went awfully bad, probably on the free host’s side. As long as you’ve control over the account, you must not delete the page because the search engines revisit it from time to time checking whether you still redirect with that URL or not.

[Excursus: When a search engine crawler fetches this page, the server returns a 200-OK because, well, it’s there. Acting as a 301/302 does not make it a standard redirect. That sounds confusing to some people, so here is the technical explanation. Server sided response codes like 200, 302, 301, 404 or 410 are sent by the Web server to the user agent in the HTTP header before the server delivers any page content to the user agent (Web browser, search engine crawler, …). The meta refresh OTOH is a client sided directive telling the user agent to disregard the page’s content and to fetch the given (new) URL to render it instead of the initially requested URL. The browser parses the redirect directive out of the file which was received with a HTTP response code 200 (OK). That’s why you don’t get a 302 or 301 when you use a server header checker.]

When a search engine crawler fetches the page above, that’s just the beginning of a pretty complex process. Search engines are large scaled systems which make use of asynchronous communication between tons of highly specialized programs. The crawler itself has nothing to do with indexing. Maybe it follows server sided redirects instantly, but that’s unlikely with meta refreshs because crawlers just fetch Web contents for unprocessed delivery to a data pool from where all sorts of processes like (vertical) indexers pull their fodder. Deleting a redirecting page in the search index might be done by process A running hourly, whilst process B instructing the crawler to fetch the redirect’s destination runs once a day, then the crawler may be swamped so that it delivers the new content a month later to process C which ran just five minutes before the content delivery and starts again not before next Monday if that’s not a bank holiday…

That means the old page may gets deindexed way before the new URL makes it in the search index. If you change anything during this period, you just confuse the pretty complex chain of processes what means that perhaps the search engine starts over by rolling back all transactions and refetching the redirecting page. Not good. Keep all kind of permanent redirects forever.

Actually, a zero meta refresh works like a 301 redirect because the engines (shall) treat is as a permanent redirect, but it’s not a native 301. In fact, due to so much abuse by spammers it might be considered less reliable than a server sided 301 sent in the HTTP header. Hence you want to express your intention clearly to the engines. You do that with several elements of the meta refresh’ing page:

The page title says that the resource was moved and tells the new location. Words like “moved” and “new URL” without surrounding gimmicks clear the message.
The zero (second) delay parameter shows that you don’t deliver visible content to (most) human visitors but switch their user agent right to the new URL.
The “noindex” robots meta tag telling the engines not to index the actual page’s contents is a signal that you don’t cheat. The “follow” value (referring to links in BODY) is just a fallback mechanismn to ensure that engines having troubles to understand the redirect at least follow and index the “click here” link.
The lack of indexable content and keywords makes clear that you don’t try to achieve SE rankings for anything except the new URL.
The H1 heading repeating the title tag’s content on the page, visible for users surfing with meta refresh = off, accelerates the message and helps the engines to figure out the seriousness of your intent.
The same goes for the text message with a clear call for action underlined with the URL introduced by other elements.

Meta refreshs like other client sided redirects (e.g. window.location = "http://example.com/newurl"; in JavaScript) can be found in every spammer’s toolbox, so don’t leave the outdated content on the page and add a JavaScript redirect only to contentless pages like the sample above. Actually, you don’t need to do that, because the number of users surfing with meta-refresh=off is only a tiny fraction of your visitors, and using JavaScript redirects is way more risky (WRT picky search engines) than a zero meta refresh. Also, JavaScript redirects -if captured by a search engine- should count as 302 and you really don’t want to deal with all the disadvantages of soft redirects.

Another interesting question is whether removing the content from the outdated page makes a difference or not. Doing a mass search+replace to insert the meta tags (refresh and robots) with no further changes to the HTML source might seem attractive from a Webmaster’s perspective. It’s fault-prone however. Creating a list mapping outdated pages to their new locations to feed a quick+dirty desktop program generating the simple HTML code above is actually easier and eliminates a couple points of failure.

Finally: Make use of meta refreshs on free hosts only. Professional hosting firms let you do server sided redirects!

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

25 comments Sebastian | URL removal, Web development, Redirects, Crawler Directives, JavaScript Redirects, Yahoo, SEO, Webmaster Central, Google

LZZR Linking™

Posted on 6 July, 2007

In why it is a good thing to link out loud LZZR explains a nicely designed method to accelerate the power of inbound links. Unfortunately this technique involves Yahoo! Pipes, which is evil. Certainly that’s a nice tool to compose feeds, but Yahoo! Pipes automatically inserts the evil nofollow crap. Hence using Pipes’ feed output to amplify links faults caused by the auto-nofollow. I’m sure LZZR can replace this component with ease, if that’s not done already.

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

3 comments Sebastian | Link Building, SEO, Yahoo, Nofollow

Yahoo! search going to torture Webmasters

Posted on 2 May, 2007

According to Danny Yahoo! supports a multi-class nonsense called robots-nocontent tag. CRAP ALERT!

Can you senseless and cruel folks at Yahoo!-search imagine how many of my clients who’d like to use that feature have copied and pasted their pages? Do you’ve a clue how many sites out there don’t make use of SSI, PHP or ASP includes, and how many sites never heard of dynamic content delivery, respectively how many sites can’t use proper content delivery techniques because they’ve to deal with legacy systems and ancient business processes? Did you ask how common templated Web design is, and I mean the weird static variant, where a new page gets build from a randomly selected source page saved as new-page.html?

It’s great that you came out with a bastardized copy of Google’s somewhat hapless (in the sense of cluttering structured code) section targeting, because we dreadfully need that functionality across all engines. And I admit that your approach is a little better than AdSense section targeting because you don’t mark payload by paydirt in comments. But why the heck did you design it that crappy? The unthoughtful draft of a microformat from what you’ve “stolen” that unfortunate idea didn’t become a standard for very good reasons. Because it’s crap. Assigning multiple class names to markup elements for the sole purpose of setting crawler directives is as crappy as inline style assignments.

Well, due to my zero-bullshit tolerance I’m somewhat upset, so I repeat: Yahoo’s robots-nocontent class name is crap by design. Don’t use it, boycott it, because if you make use of it you’ll change gazillions of files for each and every proprietary syntax supported by a single search engine in the future. When the united search geeks can agree on flawed standards like rel-nofollow, they should be able to talk about a sensible evolvement of robots.txt.

There’s a way easier solution, which doesn’t require editing tons of source files, that is standardizing CSS-like syntax to assign crawler directives to existing classes and DOM-IDs. For example extent robots.txt syntax like:

A.advertising { rel: nofollow; } /* devalue aff links */

DIV.hMenu, TD#bNav { content:noindex; rel:nofollow; } /* make site wide links unsearchable */

Unsupported robots.txt syntax doesn’t harm, proprietary attempts do harm!

Dear search engines, get together and define something useful, before each of you comes out with different half-baked workarounds like section targeting or robots-nocontent class values. Thanks!

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

7 comments Sebastian | Crap, Copy+Paste-Penalties, robots.txt, SEO, Yahoo, Microformats

How Google & Yahoo handle the link condom

Posted on 27 April, 2007

Loren Baker over at SEJ got a few official statements on use and abuse of the rel-nofollow microformat by the major players: How Google, Yahoo & Ask treat NoFollow’ed links. Great job, thanks!

Ask doesn’t “officially” support nofollow, whatever that means. Loren didn’t ask MSN, probably because he didn’t expect that they’ve even noticed that they officially support nofollow since 2005, same procedure with sitemaps by the way. Yahoo implemented it along the specs, and Google stepped way over the line the norm sets. So here is the difference:

1. Do you follow a nofollow’ed link?
Google: No (longer)
Yahoo: Yes

2. Do you index the linked page following a nofollow’ed link?
Google: Obsolete, see 1.
Yahoo: Yes

3. Does your ranking algos factor in reputation, anchor/alt/title text or whichever link love sourced from a nofollow’ed link?
Google: Obsolete, see 1.
Yahoo: No

4. Do you show nofollow’ed links in reverse citation results?
Google: Yes (in link: searches by accident, in Webmaster Central if the source page didn’t make it into the supplemental index)
Yahoo: Yes (Site Explorer)

Q&A#4 is made up but accurate. I think it’s safe to assume that MSN handles the link condom like Yahoo. (Update: As Loren clarifies in the comments, he asked MSN search but they didn’t answer in a timely fashion.)

And here’s a remarkable statement from Google’s search evangelist Adam Lasnik, who may like nofollow or not:

On a related note, though, and echoing Matt’s earlier sentiments … we hope and expect that more and more sites — including Wikipedia — will adopt a less-absolute approach to no-follow … expiring no-follows, not applying no-follows to trusted contributors, and so on.

Bravo!

Related link: rel=”nofollow” Google, Yahoo and MSN

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

Be the first to comment Sebastian | SEO, Yahoo, Google, Microformats, Nofollow

Yahoo Pipes jeopardizes the integrity of the Internet

Posted on 10 April, 2007

Update: This post, initially titled “No more nofollow-insane at Google Reader”, then updated as “(No) more nofollow-insane at Google Reader”, accused Google Reader of inserting nofollow crap. I apologize for my lazy and faulty bug report. Read the comments.

I fell in love with Yahoo pipes because that tool allowed me to funnel the tidbits contained in a shitload of noise into a more or less clear signal. Instead of checking hundreds of blog feeds, search query feeds and whatever else, I was able to feed my preferred reader with actual payload extracted from vast loads of paydirt digged from lots of sources.

Now that I’ve learned that Yahoo pipes is evil I guess I must code the filters myself. Nofollow insane is not acceptable. Nofollow madness jeopardizes the integrity of the Internet which is based on free linkage. I don’t need no stinking link condoms sneakily forced by nice looking tools utilizing nifty round corners. I’ll be way happier with a crappy and uncomfortable PHP hack feeded with OPML files and conditions pulled from a manually edited MySQL table.

Here is the evidence right from the Yahoo pipe output:<item> <title>XML sitemap auto-discovery</title> - <link> https://www.blogger.com/atom/14517520/117630008750052998 </link> - <description> <a rel=”nofollow” target=”_blank” href=”http://www.vanessafoxnude.com/”>Vanessa Fox</a> <a rel=”nofollow” target=”_blank” href=”http://googlewebmastercentral.blogspot.com/2007/04/whats-new-with-sitemapsorg.html”>makes the news</a>: In addition to <a rel=”nofollow” target=”_blank” href=”http://www.smart-it-consulting.com/google-sitemaps.htm”>sitemap</a> submissions you can add this line to your <a rel=”nofollow” target=”_blank” href=”http://www.smart-it-consulting.com/article.htm?node=140&page=46″>robots.txt</a> file: <pre>Sitemap: http://www.example.com/sitemap.xml</pre> Google, Yahoo, MSN and <a rel=”nofollow” target=”_blank” href=”http://submissions.ask.com/ping?sitemap=http://sebastianx.blogspot.com/atom.xml”>Ask</a> (<a rel=”nofollow” target=”_blank” href=”http://blog.ask.com/2007/04/sitemaps_autodi.html”>new on board</a>) then should fetch and parse the XML sitemap automatically. Next week or so the <a rel=”nofollow” target=”_blank” href=”http://sebastianx.blogspot.com/2006/02/googles-cool-robotstxt-validator.html”>cool robots.txt validator</a> will get an update too. Question: <a rel=”nofollow” target=”_blank” href=”http://sebastianx.blogspot.com/2007/04/is-xml-sitemap-autodiscovery-for.html”>Is XML Sitemap Autodiscovery for Everyone?</a> Tags: <a rel=”nofollow” style=”TEXT-DECORATION:none;” target=”_blank” href=”http://technorati.com/tag/Search+Engine+Optimization”>Search Engine Optimization</a> (<a rel=”nofollow” style=”TEXT-DECORATION:none;” target=”_blank” href=”http://technorati.com/tag/SEO”>SEO</a>) <a rel=”nofollow” style=”TEXT-DECORATION:none;” target=”_blank” href=”http://technorati.com/tag/Google”>Google</a> <a rel=”nofollow” style=”TEXT-DECORATION:none;” target=”_blank” href=”http://technorati.com/tag/sitemaps”>Sitemaps</a> <a rel=”nofollow” style=”TEXT-DECORATION:none;” target=”_blank” href=”http://technorati.com/tag/robots.txt”>robots.txt</a> <a rel=”nofollow” style=”TEXT-DECORATION:none;” target=”_blank” href=”http://technorati.com/tag/Yahoo+search”>Yahoo</a> <a rel=”nofollow” style=”TEXT-DECORATION:none;” target=”_blank” href=”http://technorati.com/tag/MSN+search”>MSN</a> <a rel=”nofollow” style=”TEXT-DECORATION:none;” target=”_blank” href=”http://technorati.com/tag/Ask+search”>Ask</a> </description> - <guid isPermaLink=”false”> tag:blogger.com,1999:blog-14517520.post-117630008750052998 </guid> <pubDate>Wed, 11 Apr 2007 07:01:27 -0700</pubDate> </item>
Also, abusing my links with target=”_blank” is not nice.

Initial post and its first update:

I’m glad Google has removed the auto-nofollow on links in blog posts. When I add a feed I trust its linkage and I don’t need no stinking condoms on pages nobody except me can see unless I share them. Thanks!

Update - Nick Baum, can you hear me?

It seems the nofollow-madness is not yet completely buried. Here is a post of mine and what Google Reader shows me when I add my blog’s feed:

And here is the same post filtered thru a Yahoo pipe:

So please tell me: why does Google auto-nofollow a link to Vanessa Fox when she gets linked via Yahoo, and uncondomizes the link from Google’s very own blogspot dot dom? Curious …

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

6 comments Sebastian | Crap, Yahoo, Nofollow

Dear search engines, please bury the rel=nofollow-fiasko

Posted on 31 January, 2007

The misuse of the rel=nofollow initiative is getting out of control. Invented to fight comment spam, nowadays it is applied to commercial links, biased editorial links, navigational links, links to worst enemies (funny example: Matt Cutts links to a SEO-Blackhat with rel=nofollow) and whatever else. Gazillions of publishers and site owners add it to their links for the wrong reasons, simply because they don’t understand its intention, its mechanism, and especially not the ongoing morphing of its semantics. Even professional webmasters and search engine experts have a hard time to follow the nofollow-beast semantically. As more its initial usage gets diluted, as more folks suspect search engines cook their secret sauce with indigestibly nofollow-ingredients.

Not only rel=nofollow wasn’t able to stop blog-spam-bots, it came with a build-in flaw: confusion.

Good news is that currently the nofollow-debate gets stoked again. Threadwatch hosts a thread titled Nofollow’s Historical Changes and Associated Hypocrisy, folks are ranting on the questionable Wikipedia decision to nofollow all outbound links, Google video folks manipulated the PageRank algo by plastering most of their links with rel=nofollow by mistake, and even Yahoo’s top gun Jeremy Zawodny is not that happy with the nofollow-debacle for a while now.

Say NO to NOFOLLOW - copyright jlh-design.com I say that it is possible to replace the unsuccessful nofollow-mechanism with an understandable and reasonable functionality to allow search engine crawler directives on link level. It can be done although there are shitloads of rel=nofollow links out there. Here is why, and how:

The value “nofollow” in the link’s REL attribute creates misunderstandings, recently even in the inventor’s company, because it is, hmmm, hapless.

In fact, back then it meant “passnoreputation” and nothing more. That is search engines shall follow those links, and they shall index the destination page, and they shall show those links in reversed citation results. They just must not pass any reputation or topical relevancy with that link.

There were micro formats better suitable to achieve the goal, for example Technorati’s votelinks, but unfortunately the united search geeks have chosen a value adapted from the robots exclusion standard, which is plain misleading because it has absolutely nothing to do with its (intended) core functionality.

I can think of cases where a real nofollow-directive for spiders on link level makes perfect sense. It could tell the spider not to fetch a particular link destination, even if the page’s robots tag says “follow”, for example printer friendly pages. I’d use an “ignore this link” directive for example in crawlable horizontal popup menus to avoid theme dilution when every page of a section (or site) links to every other page. Actually, there is more need for spider directives on HTML element level, not only in links, for example to tag templated and/or navigational page areas like with Google’s section targeting.

There is nothing wrong with a mechanism to neutralize links in user input. Just the value “nofollow” in the type-of-forward-relationship attribute is not suitable to label unchecked or not (yet) trusted links. If it is really necessary to adopt a well known value from the robots exclusion standard (and don’t misunderstand me, reusing familiar terms in the right context is a good idea in general), the “noindex” value would have been be a better choice (although not perfect). “Noindex” describes way better what happens in a SE ranking algo: it doesn’t index (in its technical meaning) a vote for the target. Period.

It is not too late to replace the rel=nofollow-fiasco with a better solution which could take care of some similar use cases too. Folks at Technorati, the W3C and whereever have done the initial work already, so it’s just a tiny task left: extending an existing norm to enable a reasonable granularity of crawler directives on link level, or better for HTML elements at all. Rel=nofollow would get deprecated, replaced by suitable and standardized values, and for a couple years the engines could interpret rel=nofollow in its primordial meaning.

Since the rel=nofollow thingy exists, it has confused gazillions of non-geeky site owners, publishers and editors on the net. Last year I’ve got a new client who added rel=nofollow to all his internal links because he saw nofollowed links on a popular and well ranked site in his industry and thought rel=nofollow could perhaps improve his own rankings. That’s just one example of many where I’ve seen intended as well as mistakenly misuse of the way too geeky nofollow-value. As Jill Whalen points out to Matt Cutts, that’s just the beginning of net-wide nofollow-insane.

Ok, we’ve learned that the “nofollow” value is a notional monster, so can we please have it removed from the search engine algos in favour of a well thought out solution, preferably asap? Thanks.

Tags: Search Engine Optimization (SEO) Link-Condom rel=nofollow Google Yahoo MSN

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

9 comments Sebastian | MSN, Yahoo, Google, Microformats, Nofollow

Yahoo! Site Explorer Finally Launched

Posted on 30 September, 2005

Finally the Yahoo! Site Explorer (BETA) got launched. It’s a nice tool showing a site owner and the competitors all indexed pages per domain, and it offers subdomain filters. Inbound links get counted per page and per site. The tool provides links to the standard submit forms. Yahoo! accepts mass submissions of plain URL lists here.

The number of inbound links seems to be way more accurate than the guessings available from linkdomain: and link: searches. Unfortunately there is no simple way to exclude internal links. So if one wants to check only 3rd party inbounds, a painfull procedure begins:
1. Export of each result page to TSV files, that’s a tab delimited format, readable by Excel and other applications.
2. The export goes per SERP with a maximum of 50 URLs, so one must delete the two header lines per file and append file by file to produce one sheet.
3. Sorting the work sheet by the second column gives a list ordered by URL.
4. Deleting all URLs from the own site gives the list of 3rd party inbounds.
5. Wait for the bugfix “exported data of all result pages are equal” (each exported data set contains the first 50 results, regardless from which result page one clicks the export link).

The result pages provide assorted lists of all URLs known to Yahoo. The ordering does not represent the site’s logical structure (defined by linkage), not even the physical structure seems to be part of the sort order (that’s not exactly what I would call a “comprehensive site map”). It looks like the first results are ordered by popularity, followed by a more or less unordered list. The URL listings contain fully indexed pages, with known but not (yet) indexed URLs mixed in (e.g. pages with a robots “noindex” meta tag). The latter can be identified by the missing cached link.

Desired improvements:
1. A filter “with/without internal links”.
2. An export function outputting the data of all result pages to one single file.
3. A filter “with/without” known but not indexed URLs.
4. Optional structural ordering on the result pages.
5. Operators like filetype: and -site:domain.com.
6. Removal of the 1,000 results limit.
7. Revisiting of submitted URL lists a la Google sitemaps.

Overall, the site explorer is a great tool and an appreciated improvement, despite the wish list above. The most interesting part of the new toy is its API, which allows querying for up to 1,000 results (page data or link data) in batches of 50 to 100 results, returned in a simple XML format (max. 5,000 queries per IP address per day).

Tags: Search Engine Optimization (SEO) Yahoo

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

Be the first to comment Sebastian | XML-Sitemaps, Yahoo, Uncategorized

Yahoo’s Site Explorer

Posted on 11 August, 2005

There is a lot of interesting reading in the SES coverage by Search Engine Roundable. Currently I’m a little sitemap addicted, thus Tim Mayer’s announcement got my attention:

Tim announces a new product named Site Explorer, where you can get your linkage data. It is a place for people to go to see which pages Yahoo indexed and to let Yahoo know about URLs Yahoo has not found as of yet …. He showed an example, you basically type in a URL into it (this is also supported via an API…), then you hit explore URL and it spits out the number of pages found in Yahoo’s index and also shows you the number of inbound links. You can sort pages by “depth” (how deep pages are buried) and you can also submit URLs here. You can also quickly export the results to TSV format.

Sounds like a pretty comfortable tool to do manual submissions, harvest data for link development etc. etc. Unfortunately it’s not yet life, I’d love to read more about the API. The concept outlined above makes me think that I may get an opportunity to shove my fresh content into Yahoo’s index way faster than today, because in comparison to other crawlers Yahoo! Slurp is a little lethargic:

Crawler stats
(tiny site) Page Fetches robots.txt Fetches

Googlebot 7755 30 73.34 MB 11 Aug 2005 - 00:03

MSNBot 1627 98 39.86 MB 10 Aug 2005 - 23:38

Yahoo! Slurp 385 204 13.61 MB 10 Aug 2005 - 23:53

I may be misleaded here, but Yahoo’s Site Explorer announcement could indicate that Yahoo will not implement Google’s Sitemap Protocol. That’ll be a shame.

Tim Mayer in another SES session:
Q: “Is there a way to do the Google sitemaps type system at Yahoo?”
Tim: We just launched the feed to be able to do that. We will be expanding the products into the future.

Tags: Search Engine Optimization (SEO) Yahoo Google Sitemaps

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

Be the first to comment Sebastian | Yahoo

« Previous Page 1 | 2

Sebastian’s Pamphlets

Archived posts from the 'Yahoo' Category

Google to change the Robots Exclusion Protocol again

The anatomy of a server sided redirect: 301, 302 and 307 illuminated SEO wise

Redirects are defined in the HTTP protocol, not in search engine guidelines

What is a server sided redirect?

Execution of server sided redirects

What is an HTTP redirect header?

The redirect response code in a HTTP status line

The redirect header’s “location” field

How to implement a server sided redirect?

Redirects in server configuration files

Redirecting directories and files with .htaccess

Redirects in server sided scripts

Redirects done by the Web server itself

Redirect or not? A few use cases.

Choosing the best redirect response code (301, 302, or 307)

301 - Moved Permanently

Moving sites with 301 redirects

302 - Found [Elsewhere]

307 - Temporary Redirect

Recap

Google and Yahoo accept undelayed meta refreshs as 301 redirects

LZZR Linking™

Yahoo! search going to torture Webmasters

How Google & Yahoo handle the link condom

Yahoo Pipes jeopardizes the integrity of the Internet

Dear search engines, please bury the rel=nofollow-fiasko

Yahoo! Site Explorer Finally Launched

Yahoo’s Site Explorer

Categories

Monthly Archives

Links

RSS Feeds

Crawler stats (tiny site)	Page Fetches	robots.txt Fetches
Googlebot	7755	30	73.34 MB	11 Aug 2005 - 00:03
MSNBot	1627	98	39.86 MB	10 Aug 2005 - 23:38
Yahoo! Slurp	385	204	13.61 MB	10 Aug 2005 - 23:53