Archived posts from the 'Google' Category

Act out your sophisticated affiliate link paranoia

GOOD: paranoid affiliate linkMy recent posts on managing affiliate links and nofollow cloaking paid links led to so many reactions from my readers that I thought explaining possible protection levels could make sense. Google’s request to condomize affiliate links is a bit, well, thin when it comes to technical tips and tricks:

Links purchased for advertising should be designated as such. This can be done in several ways, such as:
* Adding a rel=”nofollow” attribute to the <a> tag
* Redirecting the links to an intermediate page that is blocked from search engines with a robots.txt file

Also, Google doesn’t define paid links that clearly, so try this paid link definition instead before your read on. Here is my linking guide for the paranoid affiliate marketer.

Google recommends hiding of any content provided by affiliate programs from their crawlers. That means not only links and banner ads, so think about tactics to hide content pulled from a merchants data feed too. Linked graphics along with text links, testimonials and whatnot copied from an affiliate program’s sales tools page count as duplicate content (snippet) in its worst occurance.

Pasting code copied from a merchant’s site into a page’s or template’s HTML is not exactly a smart way to put ads. Those ads aren’t manageable nor trackable, and when anything must be changed, editing tons of files is a royal PITA. Even when you’re just running a few ads on your blog, a simple ad management script allows flexible administration of your adverts.

There are tons of such scripts out there, so I don’t post a complete solution, but just the code which saves your ass when a search engine hating your ads and paid links comes by. To keep it simple and stupid my code snippets are mostly taken from this blog, so when you’ve a WordPress blog you can adapt them with ease.

Cover your ass with a linking policy

Googlers as well as hired guns do review Web sites for violations of Google’s guidelines, also competitors might be in the mood to turn you in with a spam report or paid links report. A (prominently linked) full disclosure of your linking attitude can help to pass a human review by search engine staff. By the way, having a policy for dofollowed blog comments is also a good idea.

Since crawler directives like link condoms are for search engines (only), and those pay attention to your source code and hints addressing search engines like robots.txt, you should leave a note there too, look into the source of this page for an example. View sample HTML comment.

Block crawlers from your propaganda scripts

Put all your stuff related to advertising (scripts, images, movies…) in a subdirectory and disallow search engine crawling in your /robots.txt file:
User-agent: *
Disallow: /propaganda/

Of course you’ll use an innocuous name like “gnisitrevda” for this folder, which lacks a default document and can’t get browsed because you’ve a
Options -Indexes

statement in your .htaccess file. (Watch out, Google knows what “gnisitrevda” means, so be creative or cryptic.)

Crawlers sent out by major search engines do respect robots.txt, hence it’s guaranteed that regular spiders don’t fetch it. As long as you don’t cheat too much, you’re not haunted by those legendary anti-webspam bots sneakily accessing your site via AOL proxies or Level3 IPs. A robots.txt block doesn’t prevent you from surfing search engine staff, but I don’t tell you things you’d better hide from Matt’s gang.

Detect search engine crawlers

Basically there are three common methods to detect requests by search engine crawlers.

  1. Testing the user agent name (HTTP_USER_AGENT) for strings like “Googlebot”, “Slurp”, “MSNbot” or so which identify crawlers. That’s easy to spoof, for example PrefBar for FireFox lets you choose from a list of user agents.
  2. Checking the user agent name, and only when it indicates a crawler, verifying the requestor’s IP address with a reverse lookup, respectively against a cache of verified crawler IP addresses and host names.
  3. Maintaining a list of all search engine crawler IP addresses known to man, checking the requestor’s IP (REMOTE_ADDR) against this list. (That alone isn’t bullet-proof, but I’m not going to write a tutorial on industrial-strength cloaking IP delivery, I leave that to the real experts.)

For our purposes we use method 1) and 2). When it comes to outputting ads or other paid links, checking the user agent is save enough. Also, this allows your business partners to evaluate your linkage using a crawler as user agent name. Some affiliate programs won’t activate your account without testing your links. When crawlers try to follow affiliate links on the other hand, you need to verify their IP addresses for two reasons. First, you should be able to upsell spoofing users too. Second, if you allow crawlers to follow your affiliate links, this may have impact on the merchants’ search engine rankings, and that’s evil in Google’s eyes.

We use two PHP functions to detect search engine crawlers. checkCrawlerUA() returns TRUE and sets an expected crawler host name, if the user agent name identifies a major search engine’s spider, or FALSE otherwise. checkCrawlerIP($string) verifies the requestor’s IP address and returns TRUE if the user agent is indeed a crawler, or FALSE otherwise. checkCrawlerIP() does a primitive caching in a flat file, so that once a crawler was verified on its very first content request, it can be detected from this cache to avoid pretty slow DNS lookups. The input parameter is any string which will make it into the log file. checkCrawlerIP() does not verify an IP address if the user agent string doesn’t match a crawler name.

View|hide PHP code. (If you’ve disabled JavaScript you can’t grab the PHP source code!)

Grab and implement the PHP source, then you can code statements like
$isSpider = checkCrawlerUA ();
if ($isSpider) {
$relAttribute = " rel=\"nofollow\" ";
$affLink = "<a href=\"$affUrl\" $relAttribute>call for action</a>";

$isSpider = checkCrawlerIP ($sponsorUrl);
if ($isSpider) {
// don't redirect to the sponsor, return a 403 or 410 instead

More on that later.

Don’t deliver your advertising to search engine crawlers

It’s possible to serve totally clean pages to crawlers, that is without any advertising, not even JavaScript ads like AdSense’s script calls. Whether you go that far or not depends on the grade of your paranoia. Suppressing ads on a (thin|sheer) affiliate site can make sense. Bear in mind that hiding all promotional links and related content can’t guarantee indexing, because Google doesn’t index shitloads of templated pages witch hide duplicate content as well as ads from crawling, without carrying a single piece of somewhat compelling content.

Here is how you could output a totally uncrawlable banner ad:
$isSpider = checkCrawlerIP ($PHP_SELF);
print "<div class=\"css-class-sidebar robots-nocontent\">";
// output RSS buttons or so
if (!$isSpider) {
print "<script type=\"text/javascript\" src=\" adName=seobook&adServed=banner\"></script>";
print "</div>\n";

Lets look at the code above. First we detect crawlers “without doubt” (well, in some rare cases it can still happen that a suspected Yahoo crawler comes from a non-’’ host but another IP owned by Yahoo, Inktomi, Altavista or AllTheWeb/FAST, and I’ve seen similar reports of such misbehavior for other engines too, but that might have been employees surfing with a crawler-UA).

Currently the robots-nocontent  class name in the DIV is not supported by Google, MSN and Ask, but it tells Yahoo that everything in this DIV shall not be used for ranking purposes. That doesn’t conflict with class names used with your CSS, because each X/HTML element can have an unlimited list of space delimited class names. Like Google’s section targeting that’s a crappy crawler directive, though. However, it doesn’t hurt to make use of this Yahoo feature with all sorts of screen real estate that is not relevant for search engine ranking algos, for example RSS links (use autodetect and pings to submit), “buy now”/”view basket” links or references to TOS pages and alike, templated text like terms of delivery (but not the street address provided for local search) … and of course ads.

Ads aren’t outputted when a crawler requests a page. Of course that’s cloaking, but unless the united search engine geeks come out with a standardized procedure to handle code and contents which aren’t relevant for indexing that’s not deceitful cloaking in my opinion. Interestingly, in many cases cloaking is the last weapon in a webmaster’s arsenal that s/he can fire up to comply to search engine rules when everything else fails, because the crawlers behave more and more like browsers.

Delivering user specific contents in general is fine with the engines, for example geo targeting, profile/logout links, or buddy lists shown to registered users only and stuff like that, aren’t penalized. Since Web robots can’t pull out the plastic, there’s no reason to serve them ads just to waste bandwidth. In some cases search engines even require cloaking, for example to prevent their crawlers from fetching URLs with tracking variables and unavoidable duplicate content. (Example from Google: “Allow search bots to crawl your sites without session IDs or arguments that track their path through the site” is a call for search engine friendly URL cloaking.)

Is hiding ads from crawlers “safe with Google” or not?

BAD: uncloaked affiliate linkCloaking ads away is a double edged sword from a search engine’s perspective. Way too strictly interpreted that’s against the cloaking rule which states “don’t show crawlers other content than humans”, and search engines like to be aware of advertising in order to rank estimated user experiences algorithmically. On the other hand they provide us with mechanisms (Google’s section targeting or Yahoo’s robots-nocontent class name) to disable such page areas for ranking purposes, and they code their own ads in a way that crawlers don’t count them as on-the-page contents.

Although Google says that AdSense text link ads are content too, they ignore their textual contents in ranking algos. Actually, their crawlers and indexers don’t render them, they just notice the number of script calls and their placement (at least if above the fold) to identify MFA pages. In general, they ignore ads as well as other content outputted with client sided scripts or hybrid technologies like AJAX, at least when it comes to rankings.

Since in theory the contents of JavaScript ads aren’t considered food for rankings, cloaking them completely away (supressing the JS code when a crawler fetches the page) can’t be wrong. Of course these script calls as well as on-page JS code are a ranking factors. Google possibly counts ads, maybe calculates even ratios like screen size used for advertising etc. vs. space used for content presentation to determine whether a particular page provides a good surfing experience for their users or not, but they can’t argue seriously that hiding such tiny signals –which they use for the sole purposes of possible downranks– is against their guidelines.

For ages search engines reps used to encourage webmasters to obfuscate all sorts of stuff they want to hide from crawlers, like commercial links or redundant snippets, by linking/outputting with JavaScript instead of crawlable X/HTML code. Just because their crawlers evolve, that doesn’t mean that they can take back this advice. All this JS stuff is out there, on gazillions of sites, often on pages which will never be edited again.

Dear search engines, if it does not count, then you cannot demand to keep it crawlable. Well, a few super mega white hat trolls might disagree, and depending on the implementation on individual sites maybe hiding ads isn’t totally riskless in any case, so decide yourself. I just cloak machine-readable disclosures because crawler directives are not for humans, but don’t try to hide the fact that I run ads on this blog.

Usually I don’t argue with fair vs. unfair, because we talk about war business here, what means that everything goes. However, Google does everything to talk the whole Internet into obfuscating disclosing ads with link condoms of any kind, and they take a lot of flak for such campaigns, hence I doubt they would cry foul today when webmasters hide both client sided as well as server sided delivery of advertising from their crawlers. Penalizing for delivery of sheer contents would be unfair. ;) (Of course that’s stuff for a great debate. If Google decides that hiding ads from spiders is evil, they will react and don’t care about bad press. So please don’t take my opinion as professional advice. I might change my mind tomorrow, because actually I can imagine why Google might raise their eyebrows over such statements.)

Outputting ads with JavaScript, preferably in iFrames

Delivering adverts with JavaScript does not mean that one can’t use server sided scripting to adjust them dynamically. With content management systems it’s not always possible to use PHP or so. In WordPress for example, PHP is executable in templates, posts and pages (requires a plugin), but not in sidebar widgets. A piece of JavaScript on the other hand works (nearly) everywhere, as long as it doesn’t come with single quotes (WordPress escapes them for storage in its MySQL database, and then fails to output them properly, that is single quotes are converted to fancy symbols which break eval’ing the PHP code).

Lets see how that works. Here is a banner ad created with a PHP script and delivered via JavaScript:

And here is the JS call of the PHP script:
<script type="text/javascript" src=" adName=seobook&adServed=banner"></script>

The PHP script /propaganda/output.js.php evaluates the query string to pull the requested ad’s components. In case it’s expired (e.g. promotions of conferences, affiliate program went belly up or so) it looks for an alternative (there are tons of neat ways to deliver different ads dependent on the requestor’s location and whatnot, but that’s not the point here, hence the lack of more examples). Then it checks whether the requestor is a crawler. If the user agent indicates a spider, it adds rel=nofollow to the ad’s links. Once the HTML code is ready, it outputs a JavaScript statement:
document.write(‘<a href=" adName=seobook&adServed=banner" title="DOWNLOAD THE BOOK ON SEO!"><img src="" width="468" height="60" border="0" alt="The only current book on SEO" title="The only current book on SEO" /></a>’);
which the browser executes within the script tags (replace single quotes in the HTML code with double quotes). A static ad for surfers using ancient browsers goes into the noscript tag.

Matt Cutts said that JavaScript links don’t prevent Googlebot from crawling, but that those links don’t count for rankings (not long ago I read a more recent quote from Matt where he stated that this is future-proof, but I can’t find the link right now). We know that Google can interpret internal and external JavaScript code, as long as it’s fetchable by crawlers, so I wouldn’t say that delivering advertising with client sided technologies like JavaScript or Flash is a bullet-proof procedure to hide ads from Google, and the same goes for other major engines. That’s why I use rel-nofollow –on crawler requests– even in JS ads.

Change your user agent name to Googlebot or so, install Matt’s show nofollow hack or something similar, and you’ll see that the affiliate-URL gets nofollow’ed for crawlers. The dotted border in firebrick is extremely ugly, detecting condomized links this way is pretty popular, and I want to serve nice looking pages, thus I really can’t offend my readers with nofollow’ed links (although I don’t care about crawler spoofing, actually that’s a good procedure to let advertisers check out my linking attitude).

We look at the affiliate URL from the code above later on, first lets discuss other ways to make ads more search engine friendly. Search engines don’t count pages displayed in iFrames as on-page contents, especially not when the iFrame’s content is hosted on another domain. Here is an example straight from the horse’s mouth:
<iframe name="google_ads_frame" src=" very-long-and-ugly-query-string" marginwidth="0" marginheight="0" vspace="0" hspace="0" allowtransparency="true" frameborder="0" height="90" scrolling="no" width="728"></iframe>
In a noframes tag we could put a static ad for surfers using browsers which don’t support frames/iFrames.

If for some reasons you don’t want to detect crawlers, or it makes sound sense to hide ads from other Web robots too, you could encode your JavaScript ads. This way you deliver totally and utterly useless gibberish to anybody, and just browsers requesting a page will render the ads. Example: any sort of text or html block that you would like to encrypt and hide from snoops, scrapers, parasites, or bots, can be run through Michael’s Full Text/HTML Obfuscator Tool (hat tip to Donna).

Always redirect to affiliate URLs

There’s absolutely no point in using ugly affiliate URLs on your pages. Actually, that’s the last thing you want to do for various reasons.

  • For example, affiliate URLs as well as source codes can change, and you don’t want to edit tons of pages if that happens.
  • When an affiliate program doesn’t work for you, goes belly up or bans you, you need to route all clicks to another destination when the shit hits the fan. In an ideal world, you’d replace outdated ads completely with one mouse click or so.
  • Tracking ad clicks is no fun when you need to pull your stats from various sites, all of them in another time zone, using their own –often confusing– layouts, providing different views on your data, and delivering program specific interpretations of impressions or click throughs. Also, if you don’t track your outgoing traffic, some sponsors will cheat and you can’t prove your gut feelings.
  • Scrapers can steal revenue by replacing affiliate codes in URLs, but may overlook hard coded absolute URLs which don’t smell like affiliate URLs.

When you replace all affiliate URLs with the URL of a smart redirect script on one of your domains, you can really manage your affiliate links. There are many more good reasons for utilizing ad-servers, for example smart search engines which might think that your advertising is overwhelming.

Affiliate links provide great footprints. Unique URL parts respectively query string variable names gathered by Google from all affiliate programs out there are one clear signal they use to identify affiliate links. The values identify the single affiliate marketer. Google loves to identify networks of ((thin) affiliate) sites by affiliate IDs. That does not mean that Google detects each and every affiliate link at the time of the very first fetch by Ms. Googlebot and the possibly following indexing. Processes identifying pages with (many) affiliate links and sites plastered with ads instead of unique contents can run afterwords, utilizing a well indexed database of links and linking patterns, reporting the findings to the search index respectively delivering minus points to the query engine. Also, that doesn’t mean that affiliate URLs are the one and only trackable footmark Google relies on. But that’s one trackable footprint you can avoid to some degree.

If the redirect-script’s location is on the same server (in fact it’s not thanks to symlinks) and not named “adserver” or so, chances are that a heuristic check won’t identify the link’s intent as promotional. Of course statistical methods can discover your affiliate links by analyzing patterns, but those might be similar to patterns which have nothing to do with advertising, for example click tracking of editorial votes, links to contact pages which aren’t crawlable with paramaters, or similar “legit” stuff. However, you can’t fool smart algos forever, but if you’ve a good reason to hide ads every little might help. Of course, providing lots of great contents countervails lots of ads (from a search engine’s point of view, and users might agree on this).

Besides all these (pseudo) black hat thoughts and reasoning, there is a way more important advantage of redirecting links to sponsors: blocking crawlers. Yup, search engine crawlers must not follow affiliate URLs, because it doesn’t benefit you (usually). Actually, every affiliate link is a useless PageRank leak. Why should you boost the merchants search engine rankings? Better take care of your own rankings by hiding such outgoing links from crawlers, and stopping crawlers before they spot the redirect, if they by accident found an affiliate link without link condom.

The behavior of an adserver URL masking an affiliate link

Lets look at the redirect-script’s URL from my code example above:
On request of router.php the $adName variable identifies the affiliate link, $adServed tells which sort/type/variation of ad was clicked, and all that gets stored with a timestamp under title and URL of the page carrying the advert.

Now that we’ve covered the statistical requirements, router.php calls the checkCrawlerIP() function setting $isSpider to TRUE only when both the user agent as well as the host name of the requestor’s IP address identify a search engine crawler, and a reverse DNS lookup equals the requestor’s IP addy.

If the requestor is not a verified crawler, router.php does a 307 redirect to the sponsor’s landing page:
$sponsorUrl = "";
$requestProtocol = $_SERVER["SERVER_PROTOCOL"];
$protocolArr = explode("/",$requestProtocol);
$protocolName = trim($protocolArr[0]);
$protocolVersion = trim($protocolArr[1]);
if (stristr($protocolName,"HTTP")
&& strtolower($protocolVersion) > "1.0" ) {
$httpStatusCode = 307;
else {
$httpStatusCode = 302;
$httpStatusLine = "$requestProtocol $httpStatusCode Temporary Redirect";
@header($httpStatusLine, TRUE, $httpStatusCode);
@header("Location: $sponsorUrl");

A 307 redirect avoids caching issues, because 307 redirects must not be cached by the user agent. That means that changes of sponsor URLs take effect immediately, even when the user agent has cached the destination page from a previous redirect. If the request came in via HTTP/1.0, we must perform a 302 redirect, because the 307 response code was introduced with HTTP/1.1 and some older user agents might not be able to handle 307 redirects properly. User agents can cache the locations provided by 302 redirects, so possibly when they run into a page known to redirect, they might request the outdated location. For obvious reasons we can’t use the 301 response code, because 301 redirects are always cachable. (More information on HTTP redirects.)

If the requestor is a major search engine’s crawler, we perform the most brutal bounce back known to man:
if ($isSpider) {
@header("HTTP/1.1 403 Sorry Crawlers Not Allowed", TRUE, 403);
@header("X-Robots-Tag: nofollow,noindex,noarchive");

The 403 response code translates to “kiss my ass and get the fuck outta here”. The X-Robots-Tag in the HTTP header instructs crawlers that the requested URL must not be indexed, doesn’t provide links the poor beast could follow, and must not be publically cached by search engines. In other words the HTTP header tells the search engine “forget this URL, don’t request it again”. Of course we could use the 410 response code instead, which tells the requestor that a resource is irrevocably dead, gone, vanished, non-existent, and further requests are forbidden. Both the 403-Forbidden response as well as the 410-Gone return code prevent you from URL-only listings on the SERPs (once the URL was crawled). Personally, I prefer the 403 response, because it perfectly and unmistakably expresses my opinion on this sort of search engine guidelines, although currently nobody except Google understands or supports X-Robots-Tags in HTTP headers.

If you don’t use URLs provided by affiliate programs, your affiliate links can never influence search engine rankings, hence the engines are happy because you did their job so obedient. Not that they otherwise would count (most of) your affiliate links for rankings, but forcing you to castrate your links yourself makes their life much easier, and you don’t need to live in fear of penalties.

NICE: prospering affiliate linkBefore you output a page carrying ads, paid links, or other selfish links with commercial intent, check if the requestor is a search engine crawler, and act accordingly.

Don’t deliver different (editorial) contents to users and crawlers, but also don’t serve ads to crawlers. They just don’t buy your eBook or whatever you sell, unless a search engine sends out Web robots with credit cards able to understand Ajax, respectively authorized to fill out and submit Web forms.

Your ads look plain ugly with dotted borders in firebrick, hence don’t apply rel=”nofollow” to links when the requestor is not a search engine crawler. The engines are happy with machine-readable disclosures, and you can discuss everything else with the FTC yourself.

No nay never use links or content provided by affiliate programs on your pages. Encapsulate this kind of content delivery in AdServers.

Do not allow search engine crawlers to follow your affiliate links, paid links, nor other disliked votes as per search engine guidelines. Of course condomizing such links is not your responsibility, but getting penalized for not doing Google’s job is not exactly funny.

I admit that some of the stuff above is for extremely paranoid folks only, but knowing how to be paranoid might prevent you from making silly mistakes. Just because you believe that you’re not paranoid, that does not mean Google will not chase you down. You really don’t need to be a so called black hat to displease Google. Not knowing respectively not understanding Google’s 12 commandments doesn’t prevent you from being spanked for sins you’ve never heard of. If you’re keen on Google’s nicely targeted traffic, better play by Google’s rules, leastwise on creawler requests.

Feel free to contribute your tips and tricks in the comments.

Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments

Text link broker woes: Google’s smart paid link sniffers

Google's smart paid link sniffer at workAfter the recent toolbar PageRank massacre link brokers are in the spotlight. One of them, TNX beta1, asked me to post a paid review of their service. It took a while to explain that nobody can buy a sales pitch here. I offered to write a pitiless honest review for a low hourly fee, provided a sample on their request, but got no order or payment yet. Never mind. Since the topic is hot, here’s my review, paid or not.

So what does TNX offer? Basically it’s a semi-automated link exchange where everybody can sign up to sell and/or purchase text links. TNX takes 25% commission, 12.5% from the publisher, and 12.5% from the advertiser. They calculate the prices based on Google’s toolbar PageRank and link popularity pulled from Yahoo. For example a site putting five blocks of four links each on one page with toolbar PageRank 4/10 and four pages with a toolbar PR 3/10 will earn $46.80 monthly.

TNX provides a tool to vary the links, so that when an advertiser purchases for example 100 links it’s possible to output those in 100 variations of anchor text as well as surrounding text before and after the A element, on possibly 100 different sites. Also TNX has a solution to increase the number of links slowly, so that search engines can’t find a gazillion of uniformed links to a (new) site all of a sudden. Whether or not that’s sufficient to simulate natural link growth remains an unanswered question, because I’ve no access to their algorithm.

Links as well as participating sites are reviewed by TNX staff, and frequently checked with bots. Links shouldn’t appear on pages which aren’t indexed by search engines or viewed by humans, or on 404 pages, pages with long and ugly URLs and such. They don’t accept PPC links or offensive ads.

All links are outputted server sided, what requires PHP or Perl (ASP/ASPX coming soon). There is a cache option, so it’s not necessary to download the links from the TNX servers for each page view. TNX recommends renaming the /cache/ directory to avoid an easily detectable sign for the occurence of TNX paid links on a Web site. Links are stored as plain HTML, besides the target="_blank" attribute there is no obvious footprint or pattern on link level. Example:
Have a website? See this <a href="" target="_blank">free affiliate program</a>.
Have a blog? Check this <a href="" target="_blank">affiliate program with high comissions</a> for publishers.

Webmasters can enter any string as delimiter, for example <br /> or “•”:

Have a website? See this free affiliate program. • Have a blog? Check this affiliate program with high comissions for publishers.

Publishers can choose from 17 niches, 7 languages, 5 linkpop levels, and 7 toolbar PageRank values to target their ads.

From the system stats in the members area the service is widely used:

  • As of today [2007-11-06] we have 31,802 users (daily growth: +0.62%)
  • Links in the system: 31,431,380
  • Links created in last hour: 1,616
  • Number of pages indexed by TNX: 37,221,398

Long story short, TNX jumped through many hoops to develop a system which is supposed to trade paid links that are undetectable by search engines. Is that so?

The major weak point is the system’s growth and that its users are humans. Even if such a system would be perfect, users will make mistakes and reveal the whole network to search engines. Here is how Google has identified most if not all of the TNX paid links:

Some Webmasters put their TNX links in sidebars under a label that identifies them as paid links. Google crawled those pages, and stored the link destinations in its paid links database. Also, they devalued at least the labelled links, if not the whole page or even the complete site lost its ability to pass link juice because the few paid links aren’t condomized.

Many Webmasters implemented their TNX links in templates, so that they appear on a large number of pages. Actually, that’s recommended by TNX. Even if the advertisers have used the text variation tool, their URLs appeared multiple times on each site. Google can detect site wide links, even if not each and every link appears on all pages, and flags them accordingly.

Maybe even a few Googlers have signed up and served the TNX links on their personal sites to gather examples, although that wasn’t neccessary because so many Webmasters with URLs in their signatures have told Google in this DP thread that they’ve signed up and at least tested TNX links on their pages.

Next Google compared the anchor text as well as the surrounding text of all flagged links, and found some patterns. Of course putting text before and after the linked anchor text seems to be a smart way to fake a natural link, but in fact Webmasters applied a bullet-proof procedure to outsmart themselves, because with multiple occurences of the same text constellations pointing to an URL, especially when found on unrelated sites (different owners, hosts etc., topically irrelevancy plays no role in this context), paid link detection is a breeze. Linkage like that may be “natural” with regard to patterns like site wide advertising or navigation, but a lookup in Google’s links database revealed that the same text constellations and URLs were found on n  other sites too.

Now that Google had compiled the seed, each and every instance of Googlebot delivered more evidence. It took Google only one crawl cycle to identify most sites carrying TNX links, and all TNX advertisers. Paid link flags from pages on sites with a low crawling frequency were delivered in addition. Meanwhile Google has drawed a comprehensive picture of the whole TNX network.

I’ve developed such a link network many years ago (it’s defunct now). It was successful because only very experienced Webmasters controlling a fair amount of squeaky clean sites were invited. Allowing newbies to participate in such an organized link swindle is the kiss of death, because newbies do make newbie mistakes, and Google makes use of newbie mistakes to catch all participants. By the way, with the capabilities Google has today, my former approach to manipulate rankings with artificial linkage would be detectable with statistical methods similar to the algo outlined above, despite the closed circle of savvy participants.

From reading the various DP threads about TNX as well as their sales pitches, I’ve recognized a very popular misunderstanding of Google’s mentality. Folks are worrying whether an algo can detect the intention of links or not, usually focusing on particular links or linking methods. Google on the other hand looks at the whole crawlable Web. When they develop a paid link detection algo, they have a copy of the known universe to play with, as well as a complete history of each and every hyperlink crawled by Ms. Googlebot since 1998 or so. Naturally, their statistical methods will catch massive artificial linkage first, but fine tuning the sensitivity of paid link sniffers respectively creating variants to cover different linking patterns is no big deal. Of course there is always a way to hide a paid link, but nobody can hide millions of them.

Unfortunately, the unique selling point of the TNX service –that goes for all link brokers by the way– is manipulation of search engine rankings, hence even if they would offer nofollow’ed links to trade traffic instead of PageRank, most probably they would be forced to reduce the prices. Since TNX links are rather cheap, I’m not sure that will pay. It would be a shame when they decide to change the business model but it doesn’t pay for TNX, because the underlying concept is great. It just shouldn’t be used to exchange clean links. All the tricks developed to outsmart Google, like the text variation tool or not putting links on not exactly trafficked pages, are suitable to serve non-repetitive ads (coming with attractive CTRs) to humans.

I’ve asked TNX: I’ve decided to review your service on my blog, regardless whether you pay me or not. The result of my research is that I can’t recommend TNX in its current shape. If you still want a paid review, and/or a quote in the article, I’ve a question: Provided Google has drawn a detailed picture of your complete network, are you ready to switch to nofollow’ed links in order to trade traffic instead of PageRank, possibly with slightly reduced prices? Their answer:

We would be glad to accept your offer of a free review, because we don’t want to pay for a negative review.
Nobody can draw a detailed picture of our network - it’s impossible for one advertiser to buy links from all or a majority sites of our network. Many webmasters choose only relevant advertisers.
We will not switch to nofollow’ed links, but we are planning not to use Google PR for link pricing in the near future - we plan to use our own real-time page-value rank.

Well, it’s not necessary to find one or more links on all sites to identify a network.

Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments

The day the routers died

Why the fuck do we dumb and clueless Internet marketers care about Google’s Toolbar PageRank when the Internet faces real issues? Well, both the toolbar slider as well as IPv4 are somewhat finite.

I can hear the IM crowd singing “The day green pixels died” … whilst Matt’s gang in building 43 intones “No mercy, smack paid links, no place to hide for TLA links” … Enjoy this video, it’s friggin’ hilarious:


Since Gary Feldman’s song “The Day The Routers Died” will become an evergreen soon, I thought you might be interested in a transcript:

A long long time ago
I can still remember
When my laptop could connect elsewhere.

And I tell you all there was a day
The network card I threw away
Had a purpose and it worked for you and me.

But 18 years completely wasted
With each address we’ve aggregated
The tables overflowing
The traffic just stopped flowing.

And now we’re bearing all the scars
And all my traceroutes showing stars
The packets would travel faster in cars
The day the routers died.

So bye bye, folks at RIPE:55
Be persuaded to upgrade it or your network will die
IPv6 makes me let out a sigh
But I spose we’d better give it a try
I suppose we’d better give it a try!

Now did you write an RFC
That dictated how we all should be
Did we listen like we should that day?

Now were you back at RIPE fifty-four
Where we heard the same things months before
And the people knew they’d have to change their ways.

And we knew that all the ISPs
Could be future proof for centuries.

But that was then not now
Spent too much time playing WoW.

Ooh there was time we sat on IRC
Making jokes on how this day would be
Now there’s no more use for TCP
The day the routers died.

So bye bye, folks at RIPE:55
Be persuaded to upgrade it or your network will die
IPv6 just makes me let out a sigh
But I spose we’d better give it a try
I suppose we’d better give it a try!

I remember those old days I mourn
Sitting in my room, downloading porn
Yeah that’s how it used to be.

When the packets flowed from A to B
Via routers that could talk IP
There was data [that] could be exchanged between you and me.

Oh but I could see you all ignore
The fact we’d fill up IPv4!

But we all lost the nerve
And we got what we deserved!

And while we threw our network kit away
And wished we’d heard the things they say
Put all our lives in disarray
The day the routers died.

So bye bye, folks at RIPE:55
Be persuaded to upgrade it or your network will die
IPv6 just makes me let out a sigh
But I spose we’d better give it a try
I suppose we’d better give it a try!

Saw a man with whom I used to peer
Asked him to rescue my career
He just sighed and turned away.

I went down to the ‘net cafe
That I used to visit everyday
But the man there said I might as well just leave.

[And] now we’ve all lost our purpose
My cisco shares completely worthless
No future meetings for me
At the Hotel Krasnapolsky.

And the men that make us push and push
Like Geoff Huston and Randy Bush
Should’ve listened to what they told us
The day the routers died.

So bye bye, folks at RIPE:55
Be persuaded to upgrade it or your network will die
IPv6 just makes me let out a sigh
But I spose we’d better give it a try
[I suppose we’d better give it a try!]

Recorded at the RIPE:55 meeting in Amsterdam (NL) at the Krasnapolsky Hotel between 22 and 26 October 2007.

Just in case the video doesn’t load, here is another recording.

Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments

A pragmatic defence against Google’s anti paid links campaign

Google’s recent shot across the bows of a gazillion sites handling paid links, advertising, or internal cross links not compliant to Google’s imagination of a natural link is a call for action. Google’s message is clear: “condomize your commercial links or suffer” (from deducted toolbar PageRank, links without the ability to pass real PageRank and relevancy signals, or perhaps even penalties).

Paid links: good versus evilOf course that’s somewhat evil, because applying nofollow values to all sorts of links is not exactly a natural thing to do; visitors don’t care about invisible link attributes and sometimes they’re even pissed when they get redirected to an URL not displayed in their status bar. Also, this requirement forces Webmasters to invest enormous efforts in code maintenance for the sole purpose of satisfying search engines. The argument “if Google doesn’t like these links, then they can discount them in their system, without bothering us” has its merits, but unfortunately that’s not the way Google’s cookie crumbles for various reasons. Hence lets develop a pragmatic procedure to handle those links.

The problem

Google thinks that uncondomized paid links as well as commercial links to sponsors or affiliated entities aren’t natural, because the terms “sponsor|pay for review|advertising|my other site|sign-up|…” and “editorial vote” are not compatible in the sense of Google’s guidelines. This view at the Web’s linkage is pretty black vs. white.

Either you link out because a sponsor bought ads, or you don’t sell ads and link out for free because you honestly think your visitors will like a page. Links to sponsors without condom are black, links to sites you like and which you don’t label “sponsor” are white.

There’s nothing in between, respectively gray areas like links to hand picked sponsors on a page with a gazillion of links count as black. Google doesn’t care whether or not your clean links actually pass a reasonable amount of PageRank to link destinations which buy ad space too, the sole possibility that those links could  influence search results is enough to qualify you as sort of a link seller.

The same goes for paid reviews on blogs and whatnot, see for example Andy’s problem with his honest reviews which Google classifies as paid links, and of course all sorts of traffic deals, affiliate links, banner ads and stuff like that.

You don’t even need to label a clean link as advert or sponsored. If the link destination matches a domain in Google’s database of on-line advertisers, link buyers, e-commerce sites / merchants etcetera, or Google figures out that you link too much to affiliated sites or other sites you own or control, then your toolbar PageRank is toast and most probably your outgoing links will be penalized. Possibly these penalties have impact on your internal links too, what results in less PageRank landing on subsidiary pages. Less PageRank gathered by your landing pages means less crawling, less ranking, less SERP referrers, less revenue.

The solution

You’re absolutely right when you say that such search engine nitpicking should not force you to throw nofollow crap on your links like confetti. From your and my point of view condomizing links is wrong, but sometimes it’s better to pragmatically comply to such policies in order to stay in the game.

Although uncrawlable redirect scripts have advantages in some cases, the simplest procedure to condomize a link is the rel-nofollow microformat. Here is an example of a googlified affiliate link:
<a href="" rel="nofollow">Sponsor</a>

Why serve your visitors search engine crawler directives?

Complying to Google’s laws does not mean that you must deliver crawler directives like rel=”nofollow” to your visitors. Since Google is concerned about search engine rankings influenced by uncondomized links with commercial intent, serving crawler directives to crawlers and clean links to users is perfectly in line with Google’s goals. Actually, initiatives like the X-Robots-Tag make clear that hiding crawler directives from users is fine with Google. To underline that, here is a quote from Matt Cutts:

[…] If you want to sell a link, you should at least provide machine-readable disclosure for paid links by making your link in a way that doesn’t affect search engines. […]

The other best practice I’d advise is to provide human readable disclosure that a link/review/article is paid. You could put a badge on your site to disclose that some links, posts, or reviews are paid, but including the disclosure on a per-post level would better. Even something as simple as “This is a paid review” fulfills the human-readable aspect of disclosing a paid article. […]

Google’s quality guidelines are more concerned with the machine-readable aspect of disclosing paid links/posts […]

To make sure that you’re in good shape, go with both human-readable disclosure and machine-readable disclosure, using any of the methods [uncrawlable redirects, rel-nofollow] I mentioned above.
[emphasis mine]

Since Google devalues paid links anyway, search engine friendly cloaking of rel-nofollow for Googlebot is a non-issue with advertisers, as long as this fact is disclosed. I bet most link buyers look at the magic green pixels anyway, but that’s their problem.

How to cloak rel-nofollow for search engine crawlers

I’ll discuss a PHP/Apache example, but this method is adaptable to other server sided scripting languages like ASP or so with ease. If you’ve a static site and PHP is available on your (*ix) host, you need to tell Apache that you’re using PHP in .html (.htm) files. Put this statement in your root’s .htaccess file:
AddType application/x-httpd-php .html .htm

Next create a plain text file, insert the code below, and upload it as “funct_nofollow.php” or so to your server’s root directory (or a subdirectory, but then you need to change some code below).
function makeRelAttribute ($linkClass) {
$numargs = func_num_args();
// optional 2nd input parameter: $relValue
if ($numargs >= 2) {
$relValue = func_get_arg(1) ." ";
$referrer = $_SERVER["HTTP_REFERER"];
$refUrl = parse_url($referrer);
$isSerpReferrer = FALSE;
if (stristr($refUrl[host], "google.") ||
stristr($refUrl[host], "yahoo."))
$isSerpReferrer = TRUE;
$userAgent = $_SERVER["HTTP_USER_AGENT"];
$isCrawler = FALSE;
if (stristr($userAgent, "Googlebot") ||
stristr($userAgent, "Slurp"))
$isCrawler = TRUE;
if ($isCrawler /*|| $isSerpReferrer*/ ) {
if ("$linkClass" == "ad") $relValue .= "advertising nofollow";
if ("$linkClass" == "paid") $relValue .= "sponsored nofollow";
if ("$linkClass" == "own") $relValue .= "affiliated nofollow";
if ("$linkClass" == "vote") $relValue .= "editorial dofollow";
if (empty($relValue))
return "";
return " rel=\"" .trim($relValue) ."\" ";
} // end function makeRelValue

Next put the code below in a PHP file you’ve included in all scripts, for example header.php. If you’ve static pages, then insert the code at the very top.
@include($_SERVER["DOCUMENT_ROOT"] ."/funct_nofollow.php");

Do not paste the function makeRelValue itself! If you spread code this way you’ve to edit tons of files when you need to change the functionality later on.

Now you can use the function makeRelValue($linkClass,$relValue) within the scripts or HTML pages. The function has an input parameter $linkClass and knows the (self-explanatory) values “ad”, “paid”, “own” and “vote”. The second (optional) input parameter is a value for the A element’s REL attribute itself. If you provide it, it gets appended, or, if makeRelValue doesn’t detect a spider, it creates a REL attribute with this value. Examples below. You can add more user agents, or serve rel-nofollow to visitors coming from SERPs by enabling the || $isSerpReferrer condition (remove the bold /*&*/).

When you code a hyperlink, just add the function to the A tag. Here is a PHP example:
print "<a href=\"\"" .makeRelAttribute("ad") .">Google</a>";

will output
<a href="" rel="advertising nofollow" >Google</a>
when the user agent is Googlebot, and
<a href="">Google</a>
to a browser.

If you can’t write nice PHP code, for example because you’ve to follow crappy guidelines and worst practices with a WordPress blog, then you can mix HTML and PHP tags:
<a href=""<?php print makeRelAttribute("paid"); ?>>Yahoo</a>

Please note that this method is not safe with search engines or unfriendly competitors when you want to cloak for other purposes. Also, the link condoms are served to crawlers only, that means search engine staff reviewing your site with a non-crawler user agent name won’t spot the nofollow’ed links unless they check the engine’s cached page copy. An HTML comment in HEAD like “This site serves machine-readable disclosures, e.g. crawler directives like rel-nofollow applied to links with commercial intent, to Web robots only.” as well as a similar comment line in robots.txt would certainly help to pass reviews by humans.

A Google-friendly way to handle paid links, affiliate links, and cross linking

Load this page with different user agents and referrers. You can do this for example with a FireFox extension like PrefBar. For testing purposes you can use these user agent names:
Mozilla/5.0 (compatible; Googlebot/2.1; +
Mozilla/5.0 (compatible; Yahoo! Slurp;

and these SERP referrer URLs:

Just enter these values in PrefBar’s user agent respectively referrer spoofing options (click “Customize” on the toolbar, select “User Agent” / “Referrerspoof”, click “Edit”, add a new item, label it, then insert the strings above). Here is the code above in action:

Referrer URL:
User Agent Name: CCBot/2.0 (
Ad makeRelAttribute(”ad”): Google
Paid makeRelAttribute(”paid”): Yahoo
Own makeRelAttribute(”own”): Sebastian’s Pamphlets
Vote makeRelAttribute(”vote”): The Link Condom
External makeRelAttribute(”", “external”): W3C rel="external"
Without parameters makeRelAttribute(”"): Sphinn

When you change your browser’s user agent to a crawler name, or fake a SERP referrer, the REL value will appear in the right column.

When you’ve developed a better solution, or when you’ve a nofollow-cloaking tutorial for other programming languages or platforms, please let me know in the comments. Thanks in advance!

Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments

Google Toolbar PageRank deductions make sense

Google policing the Web's linkageSince toolbar PR is stale since April, and now only a few sites were “updated” without any traffic losses, I can imagine that’s just a “watch out” signal from Google, not yet a penalty. Of course it’s not a conventional toolbar PageRank update, because new pages aren’t affected. That means the deductions are not caused by a finite amount of PageRank spread over more pages discovered by Google since the last toolbar PR update.

Unfortunately, in the current toolbar PR hysteria next to nobody tries to figure out Google’s message. Crying foul is not very helpful, since Google is not exactly known as a company revising such decisions based on Webmaster rants lashing “unfair penalties”.

By the way, I think Andy is spot on. Paid links are definitively a cause of toolbar PageRank downgrades. Artificial links of any kind is another issue. Google obviously has a different take on interlinking respectively crosslinking for example. Site owners argue that it makes business sense, but Google might think most of these links come without value for their users. And there are tons more pretty common instances of “link monkey business”.

Maybe Google alerts all sorts of sites violating the SEO bible’s twelve commandments with a few less green pixels, before they roll out new filters which would catch those sins and penalize the offending pages accordingly. Actually, this would make a lot of sense.

All site owners and Webmasters monitor their toolbar PR. Any significant changes are discussed in a huge community. If the crowd assumes that artifical links cause toolbar PR deductions, many sites will change their linkage. This happened already after the first shot across the bows two weeks ago. And it will work again. Google gets the desired results: less disliked linkage, less sites selling uncondomized links.

That’s quite smart. Google has learned that they can’t ban or overpenalize popular sites, because that leads to fucked up search results for not only navigational search queries, in other words pissed searchers. Taking back a few green pixels from the toolbar on the other hand is not an effective penalty, because toolbar PR is unrelated to everything that matters. It is however a message with guaranteed delivery.

Running algos in development stage on the whole index and using their findings to manipulate toolbar PageRank data hurts nobody, but might force many Webmasters to change their stuff in order to comply to Google’s laws. As a side effect, this procedure even helps to avoid too much collateral damage when the actual filters become active later on.

There seems to exist another pattern. Most sites targeted by the recent toolbar PageRank deductions are SEO aware to some degree. They will spread the word. And complain loudly. Google has quite a few folks on the payroll who monitor the blogosphere, SEO forums, Webmaster hangouts and whatnot. Analyzing such reactions is a great way to gather input usable to validate and fine tune not yet launched algos.

Of course that’s sheer speculation. What do you think, does Google use toolbar PR as a “change your stuff or find yourself kicked out soon” message? Or ist it just a try to make link selling less attractive?

Update Insightful posts on Google’s toolbar PageRank manipulations:

And here is a pragmatic answer to Google’s paid links requirements: Cloak the hell out of your links with commercial intent!

Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments

The anatomy of a server sided redirect: 301, 302 and 307 illuminated SEO wise

HTTP RedirectsWe find redirects on every Web site out there. They’re often performed unnoticed in the background, unintentionally messed up, implemented with a great deal of ignorance, but seldom perfect from a SEO perspective. Unfortunately, the Webmaster boards are flooded with contradictorily, misleading and plain false advice on redirects. If you for example read “for SEO purposes you must make use of 301 redirects only” then better close the browser window/tab to prevent you from crappy advice. A 302 or 307 redirect can be search engine friendly too.

With this post I do plan to bore you to death. So lean back, grab some popcorn, and stay tuned for a longish piece explaining the Interweb’s forwarding requests as dull as dust. Or, if you know everything about redirects, then please digg, sphinn and stumble this post before you surf away. Thanks.

Redirects are defined in the HTTP protocol, not in search engine guidelines

For the moment please forget everything you’ve heard about redirects and their SEO implications, clear your mind, and follow me to the very basics defined in the HTTP protocol. Of course search engines interpret some redirects in a non-standard way, but understanding the norm as well as its use and abuse is necessary to deal with server sided redirects. I don’t bother with outdated HTTP 1.0 stuff, although some search engines still apply it every once in a while, hence I’ll discuss the 307 redirect introduced in HTTP 1.1 too. For information on client sided redirects please refer to Meta Refresh - the poor man’s 301 redirect or read my other pamphlets on redirects, and stay away from JavaScript URL manipulations.

What is a server sided redirect?

Think about an HTTP redirect as a forwarding request. Although redirects work slightly different from snail mail forwarding requests, this analogy perfectly fits the procedure. Whilst with US Mail forwarding requests a clerk or postman writes the new address on the envelope before it bounces in front of a no longer valid respectively temporarily abandoned letter-box or pigeon hole, on the Web the request’s location (that is the Web server responding to the server name part of the URL) provides the requestor with the new location (absolute URL).

A server sided redirect tells the user agent (browser, Web robot, …) that it has to perform another request for the URL given in the HTTP header’s “location” line in order to fetch the requested contents. The type of the redirect (301, 302 or 307) also instructs the user agent how to perform future requests of the Web resource. Because search engine crawlers/indexers try to emulate human traffic with their content requests, it’s important to choose the right redirect type both for humans and robots. That does not mean that a 301-redirect is always the best choice, and it certainly does not mean that you always must return the same HTTP response code to crawlers and browsers. More on that later.

Execution of server sided redirects

Server sided redirects are executed before your server delivers any content. In other words, your server ignores everything it could deliver (be it a static HTML file, a script output, an image or whatever) when it runs into a redirect condition. Some redirects are done by the server itself (see handling incomplete URIs), and there are several places where you can set (conditional) redirect directives: Apache’s httpd.conf, .htaccess, or in application layers for example in PHP scripts. (If you suffer from IIS/ASP maladies, this post is for you.) Examples:

Browser Request:
Apache: 301 header:
.htaccess:   301 header:
/page.php:     301 header:
200 header:
(Info like content length...)

Article #2

The 301 header may or may not be followed by a hyperlink pointing to the new location, solely added for user agents which can’t handle redirects. Besides that link, there’s no content sent to the client after the redirect header.

More important, you must not send a single byte to the client before the HTTP header. If you for example code [space(s)|tab|new-line|HTML code]<?php ... in a script that shall perform a redirect or is supposed to return a 404 header (or any HTTP header different from the server’s default instructions), you’ll produce a runtime error. The redirection fails, leaving the visitor with an ugly page full of cryptic error messages but no link to the new location.

That means in each and every page or script which possibly has to deal with the HTTP header, put the logic testing those conditions at the very top. Always send the header status code and optional further information like a new location to the client before you process the contents.

After the last redirect header line terminate execution with the “L” parameter in .htaccess, PHP’s exit; statement, or whatever.

What is an HTTP redirect header?

An HTTP redirect, regardless its type, consists of two lines in the HTTP header. In this example I’ve requested, which is an invalid URI because my server name lacks the www-thingy, hence my canonicalization routine outputs this HTTP header:
HTTP/1.1 301 Moved Permanently
Date: Mon, 01 Oct 2007 17:45:55 GMT
Server: Apache/1.3.37 (Unix) PHP/4.4.4

Connection: close
Transfer-Encoding: chunked
Content-Type: text/html; charset=iso-8859-1

The redirect response code in a HTTP status line

The first line of the header defines the protocol version, the reponse code, and provides a human readable reason phrase. Here is a shortened and slightly modified excerpt quoted from the HTTP/1.1 protocol definition:


The first line of a Response message is the Status-Line, consisting of the protocol version followed by a numeric status code and its associated textual phrase, with each element separated by SP (space) characters. No CR or LF is allowed except in the final CRLF sequence.

Status-Line = HTTP-Version SP Status-Code SP Reason-Phrase CRLF
[e.g. “HTTP/1.1 301 Moved Permanently” + CRLF]

Status Code and Reason Phrase

The Status-Code element is a 3-digit integer result code of the attempt to understand and satisfy the request. […] The Reason-Phrase is intended to give a short textual description of the Status-Code. The Status-Code is intended for use by automata and the Reason-Phrase is intended for the human user. The client is not required to examine or display the Reason-Phrase.

The first digit of the Status-Code defines the class of response. The last two digits do not have any categorization role. […]:
- 3xx: Redirection - Further action must be taken in order to complete the request

The individual values of the numeric status codes defined for HTTP/1.1, and an example set of corresponding Reason-Phrases, are presented below. The reason phrases listed here are only recommendations — they MAY be replaced by local equivalents without affecting the protocol [that means you could translate and/or rephrase them].
300: Multiple Choices
301: Moved Permanently
302: Found [Elsewhere]
303: See Other
304: Not Modified
305: Use Proxy

307: Temporary Redirect

In terms of SEO the understanding of 301/302-redirects is important. 307-redirects, introduced with HTTP/1.1, are still capable to confuse some search engines, even major players like Google when Ms. Googlebot for some reasons thinks she must do HTTP/1.0 requests, usually caused by weird respectively ancient server configurations (or possibly testing newly discovered sites under certain circumstances). You should not perform 307 redirects as response to most HTTP/1.0 requests, use 302/301 –whatever fits best– instead. More info on this issue below in the 302/307 sections.

Please note that the default reponse code of all redirects is 302. That means when you send a HTTP header with a location directive but without an explicit response code, your server will return a 302-Found status line. That’s kinda crappy, because in most cases you want to avoid the 302 code like the plague. Do no nay never rely on default response codes! Always prepare a server sided redirect with a status line telling an actual response code (301, 302 or 307)! In server sided scripts (PHP, Perl, ColdFusion, JSP/Java, ASP/VB-Script…) always send a complete status line, and in .htaccess or httpd.conf add a [R=301|302|307,L] parameter to statements like RewriteRule:
RewriteRule (.*)$1 [R=301,L]

The redirect header’s “location” field

The next element you need in every redirect header is the location directive. Here is the official syntax:


The Location response-header field is used to redirect the recipient to a location other than the Request-URI for completion of the request or identification of a new resource. […] For 3xx responses, the location SHOULD indicate the server’s preferred URI for automatic redirection to the resource. The field value consists of a single absolute URI.

Location = “Location” “:” absoluteURI [+ CRLF]

An example is:


Redirect to absolute URLs onlyPlease note that the value of the location field must be an absolute URL, that is a fully qualified URL with scheme (http|https), server name (domain|subdomain), and path (directory/file name) plus the optional query string (”?” followed by variable/value pairs like ?id=1&page=2...), no longer than 2047 bytes (better 255 bytes because most scripts out there don’t process longer URLs for historical reasons). A relative URL like ../page.php might work in (X)HTML (although you better plan a spectacular suicide than any use of relative URIs!), but you must not use relative URLs in HTTP response headers!

How to implement a server sided redirect?

You can perform HTTP redirects with statements in your Web server’s configuration, and in server sided scripts, e.g. PHP or Perl. JavaScript is a client sided language and therefore lacks a mechanism to do HTTP redirects. That means all JS redirects count as a 302-Found response.

Bear in mind that when you redirect, you possibly leave tracks of outdated structures in your HTML code, not to speak of incoming links. You must change each and every internal link to the new location, as well as all external links you control or where you can ask for an URL update. If you leave any outdated links, visitors probably don’t spot it (although every redirect slows things down), but search engine spiders continue to follow them, what ends in redirect chains eventually. Chained redirects often are the cause of deindexing pages, site areas or even complete sites by search engines, hence do no more than one redirect in a row and consider two redirects in a row risky. You don’t control offsite redirects, in some cases a search engine has already counted one or two redirects before it requests your redirecting URL (caused by redirecting traffic counters etcetera). Always redirect to the final destination to avoid useless hops which kill your search engine traffic. (Google recommends “that you use fewer than five redirects for each request”, but don’t try to max out such limits because other services might be less BS-tolerant.)

Like conventional forwarding requests, redirects do expire. Even a permanent 301-redirect’s source URL will be requested by search engines every now and then because they can’t trust you. As long as there is one single link pointing to an outdated and redirecting URL out there, it’s not forgotten. It will stay alive in search engine indexes and address books of crawling engines even when the last link pointing to it was changed or removed. You can’t control that, and you can’t find all inbound links a search engine knows, despite their better reporting nowadays (neither Yahoo’s site explorer nor Google’s link stats show you all links!). That means you must maintain your redirects forever, and you must not remove (permanent) redirects. Maintenance of redirects includes hosting abandoned domains, and updates of location directives whenever you change the final structure. With each and every revamp that comes with URL changes check for incoming redirects and make sure that you eliminate unnecessary hops.

Often you’ve many choices where and how to implement a particular redirect. You can do it in scripts and even static HTML files, CMS software, or in the server configuration. There’s no such thing as a general best practice, just a few hints to bear in mind.

  • Redirects are dynamite, so blast carefullyDoubt: Don’t believe Web designers and developers when they say that a particular task can’t be done without redirects. Do your own research, or ask an SEO expert. When you for example plan to make a static site dynamic by pulling the contents from a database with PHP scripts, you don’t need to change your file extensions from *.html to *.php. Apache can parse .html files for PHP, just enable that in your root’s .htaccess:
    AddType application/x-httpd-php .html .htm .shtml .txt .rss .xml .css

    Then generate tiny PHP scripts calling the CMS to replace the outdated .html files. That’s not perfect but way better than URL changes, provided your developers can manage the outdated links in the CMS’ navigation. Another pretty popular abuse of redirects is click tracking. You don’t need a redirect script to count clicks in your database, make use of the onclick event instead.
  • Transparency: When the shit hits the fan and you need to track down a redirect with not more than the HTTP header’s information in your hands, you’ll begin to believe that performance and elegant coding is not everything. Reading and understanding a large httpd.conf file, several complex .htaccess files, and searching redirect routines in a conglomerate of a couple generations of scripts and include files is not exactly fun. You could add a custom field identifying the piece of redirecting code to the HTTP header. In .htaccess that would be achieved with
    Header add X-Redirect-Src "/content/img/.htaccess"

    and in PHP with
    header("X-Redirect-Src: /scripts/inc/header.php", TRUE);

    (Whether or not you should encode or at least obfuscate code locations in headers depends on your security requirements.)
  • Encapsulation: When you must implement redirects in more than one script or include file, then encapsulate all redirects including all the logic (redirect conditions, determining new locations, …). You can do that in an include file with a meaningful file name for example. Also, instead of plastering the root’s .htaccess file with tons of directory/file specific redirect statements, you can gather all requests for redirect candidates and call a script which tests the REQUEST_URI to execute the suitable redirect. In .htaccess put something like:
    RewriteEngine On
    RewriteBase /old-stuff
    RewriteRule ^(.*)\.html$ do-redirects.php

    This code calls /old-stuff/do-redirects.php for each request of an .html file in /old-stuff/. The PHP script:
    $requestUri = $_SERVER["REQUEST_URI"];
    if (stristr($requestUri, "/contact.html")) {
    $location = "";
    if ($location) {
    @header("HTTP/1.1 301 Moved Permanently", TRUE, 301);
    @header("X-Redirect-Src: /old-stuff/do-redirects.php", TRUE);
    @header("Location: $location");
    else {
    [output the requested file or whatever]

    (This is also an example of a redirect include file which you could insert at the top of a header.php include or so. In fact, you can include this script in some files and call it from .htaccess without modifications.) This method will not work with ASP on IIS because amateurish wannabe Web servers don’t provide the REQUEST_URI variable.
  • Documentation: When you design or update an information architecture, your documentation should contain a redirect chapter. Also comment all redirects in the source code (your genial regular expressions might lack readability when someone else looks at your code). It’s a good idea to have a documentation file explaining all redirects on the Web server (you might work with other developers when you change your site’s underlying technology in a few years).
  • Maintenance: Debugging legacy code is a nightmare. And yes, what you write today becomes legacy code in a few years. Thus keep it simple and stupid, implement redirects transparent rather than elegant, and don’t forget that you must change your ancient redirects when you revamp a site area which is the target of redirects.
  • Performance: Even when performance is an issue, you can’t do everything in httpd.conf. When you for example move a large site changing the URL structure, the redirect logic becomes too complex in most cases. You can’t do database lookups and stuff like that in server configuration files. However, some redirects like for example server name canonicalization should be performed there, because they’re simple and not likely to change. If you can’t change httpd.conf, .htaccess files are for you. They’re are slower than cached config files but still faster than application scripts.

Redirects in server configuration files

Here is an example of a canonicalization redirect in the root’s .htaccess file:
RewriteEngine On
RewriteCond %{HTTP_HOST} !^sebastians-pamphlets\.com [NC]
RewriteRule (.*)$1 [R=301,L]

  1. The first line enables Apache’s mod_rewrite module. Make sure it’s available on your box before you copy, paste and modify the code above.
  2. The second line checks the server name in the HTTP request header (received from a browser, robot, …). The “NC” parameter ensures that the test of the server name (which is, like the scheme part of the URI, not case sensitive by definition) is done as intended. Without this parameter a request of http://SEBASTIANS-PAMPHLETS.COM/ would run in an unnecessary redirect. The rewrite condition returns TRUE when the server name is not There’s an important detail: not “!”

    Most Webmasters do it the other way round. They check if the server name equals an unwanted server name, for example with RewriteCond %{HTTP_HOST} ^www\.example\.com [NC]. That’s not exactly efficient, and fault-prone. It’s not efficient because one needs to add a rewrite condition for each and every server name a user could type in and the Web server would respond to. On most machines that’s a huge list like “,,, …” because the default server configuration catches all not explicitely defined subdomains.

    Of course next to nobody puts that many rewrite conditions into the .htaccess file, hence this method is fault-prone and not suitable to fix canonicalization issues. In combination with thoughtlessly usage of relative links (bullcrap that most designers and developers love out of lazyness and lack of creativity or at least fantasy), one single link to an existing page on a non-exisiting subdomain not redirected in such an .htaccess file could result in search engines crawling and possibly even indexing a complete site under the unwanted server name. When a savvy competitor spots this exploit you can say good bye to a fair amount of your search engine traffic.

    Another advantage of my single line of code is that you can point all domains you’ve registered to catch type-in traffic or whatever to the same Web space. Every new domain runs into the canonicalization redirect, 100% error-free.

  3. The third line performs the 301 redirect to the requested URI using the canonical server name. That means when the request URI was, the user agent gets redirected to The “R” parameter sets the reponse code, and the “L” parameter means leave if the|one condition matches (=exit), that is the statements following the redirect execution, like other rewrite rules and such stuff, will not be parsed.

If you’ve access to your server’s httpd.conf file (what most hosting services don’t allow), then better do such redirects there. The reason for this recommendation is that Apache must look for .htaccess directives in the current directory and all its upper levels for each and every requested file. If the request is for a page with lots of embedded images or other objects, that sums up to hundreds of hard disk accesses slowing down the page loading time. The server configuration on the other hand is cached and therefore way faster. Learn more about .htaccess disadvantages. However, since most Webmasters can’t modify their server configuration, I provide .htaccess examples only. If you can do, then you know how to put it in httpd.conf. ;)

Redirecting directories and files with .htaccess

When you need to redirect chunks of static pages to another location, the easiest way to do that is Apache’s redirect directive. The basic syntax is Redirect [301|302|307] Path URL, e.g. Redirect 307 /blog/feed or Redirect 301 /contact.htm /blog/contact/. Path is always a file system path relative to the Web space’s root. URL is either a fully qualified URL (on another machine) like, or a relative URL on the same server like /blog/contact/ (Apache adds scheme and server in this case, so that the HTTP header is build with an absolute URL in the location field; however, omitting the scheme+server part of the target URL is not recommended, see the warning below).

When you for example want to consolidate a blog on its own subdomain and a corporate Web site at, then put
Redirect 301 /

in the .htacces file of When you then request you’re redirected to

Say you’ve moved your product pages from /products/*.htm to /shop/products/*.htm then put
Redirect 301 /products

Omit the trailing slashes when you redirect directories. To redirect particular files on the other hand you must fully qualify the locations:
Redirect 302 /misc/contact.html

or, when the new location resides on the same server:
Redirect 301 /misc/contact.html /cms/contact.php

Warning: Although Apache allows local redirects like Redirect 301 /misc/contact.html /cms/contact.php, with some server configurations this will result in 500 server errors on all requests. Therefore I recommend the use of fully qualified URLs as redirect target, e.g. Redirect 301 /misc/contact.html!

Maybe you found a reliable and unbeatable cheap hosting service to host your images. Copy all image files from to and keep the directory structures as well as all file names. Then add to’s .htaccess
RedirectMatch 301 (.*)\.([Gg][Ii][Ff]|[Pp][Nn][Gg]|[Jj][Pp][Gg])$$1.$2

The regex should match e.g. /img/nav/arrow-left.png so that the user agent is forced to request Say you’ve converted your GIFs and JPGs to the PNG format during this move, simply change the redirect statement to
RedirectMatch 301 (.*)\.([Gg][Ii][Ff]|[Pp][Nn][Gg]|[Jj][Pp][Gg])$$1.png

With regular expressions and RedirectMatch you can perform very creative redirects.

Please note that the response codes used in the code examples above most probably do not fit the type of redirect you’d do in real life with similar scenarios. I’ll discuss use cases for all redirect response codes (301|302|307) later on.

Redirects in server sided scripts

You can do HTTP redirects only with server sided programming languages like PHP, ASP, Perl etcetera. Scripts in those languages generate the output before anything is send to the user agent. It should be a no-brainer, but these PHP examples don’t count as server sided redirects:
print "<META HTTP-EQUIV=Refresh CONTENT="0; URL=">\n";
print "<script type="text/javascript">window.location = "";</script>\n";

Just because you can output a redirect with a server sided language that does not make the redirect an HTTP redirect. ;)

In PHP you perform HTTP redirects with the header() function:
$newLocation = "";
@header("HTTP/1.1 301 Moved Permanently", TRUE, 301);
@header("Location: $newLocation");

The first input parameter of header() is the complete header line, in the first line of code above that’s the status-line. The second parameter tells whether a previously sent header line shall be replaced (default behavior) or not. The third parameter sets the HTTP status code, don’t use it more than once. If you use an ancient PHP version (prior 4.3.0) you can’t put the 2nd and 3rd input parameter. The “@” suppresses PHP warnings and error messages.

With ColdFusion you code
<CFHEADER statuscode="307" statustext="Temporary Redirect">
<CFHEADER name="Location" value="">

A redirecting Perl script begins with
#!/usr/bin/perl -w
use strict;
print "Status: 302 Found Elsewhere\r\n", "Location:\r\n\r\n";

Even with ASP you can do server sided redirects. VBScript:
Dim newLocation
newLocation = ""
Response.Status = "301 Moved Permanently"
Response.AddHeader "Location", newLocation

Function RedirectPermanent(newLocation) {
Response.Status = 301;
Response.AddHeader("Location", newLocation);
Response.Buffer = TRUE;
RedirectPermanent ("");

Again, if you suffer from IIS/ASP maladies: here you go.

Remember: Don’t output anything before the redirect header, and nothing after the redirect header!

Redirects done by the Web server itself

When you read your raw server logs, you’ll find a few 302 and/or 301 redirects Apache has performed without an explicit redirect statement in the server configuration, .htaccess, or a script. Most of these automatic redirects are the result of a very popular bullshit practice: removing trailing slashes. Although the standard defines that an URI like /directory is not a file name by default, therefore equals /directory/ if there’s no file named /directory, choosing the version without the trailing slash is lazy at least, and creates lots of troubles (404s in some cases, otherwise external redirects, but always duplicate content issues you should fix with URL canonicalization routines).

For example Yahoo is a big fan of truncated URLs. They might save a few terabytes in their indexes by storing URLs without the trailing slash, but they send every user’s browser twice to those locations. Web servers must do a 302 or 301 redirect on each Yahoo-referrer requesting a directory or pseudo-directory, because they can’t serve the default document of an omitted path segment (the path component of an URI begins with a slash, the slash is its segment delimiter, and a trailing slash stands for the last (or only) segment representing a default document like index.html). From the Web server’s perspective /directory does not equal /directory/, only /directory/ addresses /directory/index.(htm|html|shtml|php|...), whereby the file name of the default document must be omitted (among other things to preserve the URL structure when the underlying technology changes). Also, the requested URI without its trailing slash may address a file or an on the fly output (if you make use of mod_rewrite to mask ugly URLs you better test what happens with screwed URIs of yours).

Yahoo wastes even their own resources. Their crawler persistently requests the shortened URL, what bounces with a redirect to the canonical URL. Here is an example from my raw logs: - - [05/Oct/2007:01:13:04 -0400] "GET /directory HTTP/1.0″ 301 26 “-” “Mozilla/5.0 (compatible; Yahoo! Slurp;” - - [05/Oct/2007:01:13:06 -0400] “GET /directory/ HTTP/1.0″ 200 8642 “-” “Mozilla/5.0 (compatible; Yahoo! Slurp;”
[I’ve replaced a rather long path with “directory”]

If you persistently redirect Yahoo to the canonical URLs (with trailing slash), they’ll use your canonical URLs on the SERPs eventually (but their crawler still requests Yahoo-generated crap). Having many good inbound links as well as clean internal links –all with the trailing slash– helps too, but is not a guarantee for canonical URL normalization at Yahoo.

Here is an example. This URL responds with 200-OK, regardless whether it’s requested with or without the canonical trailing slash:
(That’s the default (mis)behavior of everybody’s darling with permalinks by the way. Here is some PHP canonicalization code to fix this flaw.) All internal links use the canonical URL. I didn’t find a serious inbound link pointing to a truncated version of this URL. Yahoo’s Site Explorer lists the URL without the trailing slash: […]/im-confused, and the same happens on Yahoo’s SERPs: […]/im-confused. Even when a server responds 200-OK to two different URLs, a serious search engine should normalize according to the internal links as well as an entry in the XML sitemap, therefore choose the URL with the trailing slash as canonical URL.

Fucking up links on search result pages is evil enough, although fortunately this crap doesn’t influence discovery crawling directly because those aren’t crawled by other search engines (but scraped or syndicated search results are crawlable). Actually, that’s not the whole horror story. Other Yahoo properties remove the trailing slashes from directory and home page links too (look at the “What Readers Viewed” column in your MBL stats for example), and some of those services provide crawlable pages carrying invalid links (pulled from the search index or screwed otherwise). That means other search engines pick those incomplete URLs from Yahoo’s pages (or other pages with links copied from Yahoo pages), crawl them, and end up with search indexes blown up with duplicate content. Maybe Yahoo does all that only to burn Google’s resources by keeping their canonicalization routines and duplicate content filters busy, but it’s not exactly gentlemanlike that such cat fights affect all Webmasters across the globe. Yahoo directly as well as indirectly burns our resources with unnecessary requests of screwed URLs, and we must implement sanitizing redirects for software like WordPress –which doesn’t care enough about URL canonicalization–, just because Yahoo manipulates our URLs to peeve Google. Doh!

If somebody from Yahoo (or MSN, or any other site manipulating URLs this way) reads my rant, I highly recommend this quote from Tim Berners-Lee (January 2005):

Scheme-Based Normalization
[…] the following […] URIs are equivalent:
In general, an URI that uses the generic syntax for authority with an empty path should be normalized to a path of “/”.
Normalization should not remove delimiters [”/” or “?”] when their associated component is empty unless licensed to do so by the scheme specification. [emphasis mine]

In my book sentences like “Note that the absolute path cannot be empty; if none is present in the original URI, it MUST be given as ‘/’ […]” in the HTTP specification as well as Section 3.3 of the URI’s Path Segment specs do not sound like a licence to screw URLs. Omitting the path segment delimiter “/” representing an empty last path segment might sound legal if the specs are interpreted without applying common sense, but knowing that Web servers can’t respond to requests of those incomplete URIs and nevertheless truncating trailing slashes is a brain dead approach (actually, such crap deserves a couple unprintable adjectives).

Frequently scanning the raw logs for 302/301 redirects is a good idea. Also, implement documented canonicalization redirects when a piece of software responds to different versions of URLs. It’s the Webmaster’s responsibility to ensure that each piece of content is available under one and only one URL. You cannot rely on any search engine’s URL canonicalization, because shit happens, even with high sophisticated algos:

When search engines crawl identical content through varied URLs, there may be several negative effects:

1. Having multiple URLs can dilute link popularity. For example, in the diagram above [example in Google’s blog post], rather than 50 links to your intended display URL, the 50 links may be divided three ways among the three distinct URLs.

2. Search results may display user-unfriendly URLs […]

Redirect or not? A few use cases.

Before I blather about the three redirect response codes you can choose from, I’d like to talk about a few situations where you shall not redirect, and cases where you probably don’t redirect but should do so.

Unfortunately, it’s a common practice to replace various sorts of clean links with redirects. Whilst legions of Webmasters don’t obfuscate their affiliate links, they hide their valuable outgoing links in fear of PageRank leaks and other myths, or react to search engine FUD with castrated links.

With very few exceptions, the A Element a.k.a. Hyperlink is the best method to transport link juice (PageRank, topical relevancy, trust, reputation …) as well as human traffic. Don’t abuse my beloved A Element:
<a onclick="window.location = ''; return false;" title="">bad example</a>

Such a “link” will transport some visitors, but does not work when JavaScript is disabled or the user agent is a Web robot. This “link” is not an iota better:
<a href="" title="Another bad example">example</a>

Simplicity pays. You don’t need the complexity of HREF values changed to ugly URLs of redirect scripts with parameters, located in an uncrawlable path, just because you don’t want that search engines count the links. Not to speak of cases where redirecting links is unfair or even risky, for example click tracking scripts which do a redirect.

  • If you need to track outgoing traffic, then by all means do it in a search engine friendly way with clean URLs which benefit the link destination and don’t do you any harm, here is a proven method.
  • If you really can’t vouch for a link, for example because you link out to a so called bad neighborhood (whatever that means), or to a link broker, or to someone who paid for the link and Google can detect it or a competitor can turn you in, then add rel=”nofollow” to the link. Yeah, rel-nofollow is crap … but it’s there, it works, we won’t get something better, and it’s less complex than redirects, so just apply it to your fishy links as well as to unmoderated user input.
  • If you decide that an outgoing link adds value for your visitors, and you personally think that the linked page is a great resource, then almost certainly search engines will endorse the link (regardless whether it shows a toolbar PR or not). There’s way too much FUD and crappy advice out there.
  • You really don’t lose PageRank when you link out. Honestly gained PageRanks sticks at your pages. You only lower the amount of PageRank you can pass to your internal links a little. That’s not a bad thing, because linking out to great stuff can bring in more PageRank in the form of natural inbound links (there are other advantages too). Also, Google dislikes PageRank hoarding and the unnatural link patterns you create with practices like that.
  • Every redirect slows things down, and chances are that a user agent messes with the redirect what can result in rendering nil, scrambled stuff, or something completely unrelated. I admit that’s not a very common problem, but it happens with some outdated though still used browsers. Avoid redirects where you can.

In some cases you should perform redirects for sheer search engine compliance, in other words selfish SEO purposes. For example don’t let search engines handle your affiliate links.

  • If you operate an affiliate program, then internally redirect all incoming affiliate links to consolidate your landing page URLs. Although incoming affiliate links don’t bring much link juice, every little helps when it lands on a page which doesn’t credit search engine traffic to an affiliate.
  • Search engines are pretty smart when it comes to identifying affiliate links. (Thin) affiliate sites suffer from decreasing search engine traffic. Fortunately, the engines respect robots.txt, that means they usually don’t follow links via blocked subdirectories. When you link to your merchants within the content, using URLs that don’t smell like affiliate links, it’s harder to detect the intention of those links algorithmically. Of course that doesn’t prevent you from smart algos trained to spot other patterns, and this method will not pass reviews by humans, but it’s worth a try.
  • If you’ve pages which change their contents often by featuring for example a product of the day, you might have a redirect candidate. Instead of duplicating a daily changing product page, you can do a dynamic soft redirect to the product pages. Whether a 302 or a 307 redirect is the best choice depends on the individual circumstances. However, you can promote the hell out of the redirecting page, so that it gains all the search engine love without passing on PageRank etc. to product pages which phase out after a while. (If the product page is hosted by the merchant you must use a 307 response code. Otherwise make sure the 302′ing URL ist listed in your XML sitemap with a high priority. If you can, send a 302 with most HTTP/1.0 requests, and a 307 responding to HTTP/1.1 requests. See the 302/307 sections for more information.)
  • If an URL comes with a session-ID or another tracking variable in its query string, you must 301-redirect search engine crawlers to an URI without such randomly generated noise. There’s no need to redirect a human visitor, but search engines hate tracking variables so just don’t let them fetch such URLs.
  • There are other use cases involving creative redirects which I’m not willing to discuss here.

Of course both lists above aren’t complete.

Choosing the best redirect response code (301, 302, or 307)

Choosing a redirect response codeI’m sick of articles like “search engine friendly 301 redirects” propagating that only permanant redirects work with search engines. That’s a lie. I read those misleading headlines daily on the webmaster boards, in my feed reader, at Sphinn, and elsewhere … and I’m not amused. Lemmings. Amateurish copycats. Clueless plagiarists. [Insert a few lines of somewhat offensive language and swearing ;) ]

Of course most redirects out there return the wrong response code. That’s because the default HTTP response code for all redirects is 302, and many code monkeys forget to send a status-line providing the 301 Moved Permanantly when an URL was actually moved or the requested URI is not the canonical URL. When a clueless coder or hosting service invokes a Location: header statement without a previous HTTP/1.1 301 Moved Permanantly status-line, the redirect becomes a soft 302 Found. That does not mean that 302 or 307 redirects aren’t search engine friendly at all. All HTTP redirects can be safely used with regard to search engines. The point is that one must choose the correct response code based on the actual circumstances and goals. Blindly 301′ing everything is counterproductive sometimes.

301 - Moved Permanently

301 Moved PermanentlyThe message of a 301 reponse code to the requestor is: “The requested URI has vanished. It’s gone forever and perhaps it never existed. I will never supply any contents under this URI (again). Request the URL given in location, and replace the outdated respectively wrong URL in your bookmarks/records by the new one for future requests. Don’t bother me again. Farewell.”

Lets start with the definition of a 301 redirect quoted from the HTTP/1.1 specifications:

The requested resource has been assigned a new permanent URI and any future references to this resource SHOULD use one of the returned URIs [(1)]. Clients with link editing capabilities ought to automatically re-link references to the Request-URI to one or more of the new references returned by the server, where possible. This response is cacheable unless indicated otherwise.

The new permanent URI SHOULD be given by the Location field in the response. Unless the request method was HEAD, the entity of the response SHOULD contain a short hypertext note with a hyperlink to the new URI(s). […]

Read a polite “SHOULD” as “must”.

(1) Although technically you could provide more than one location, you must not do that because it irritates too many user agents, search engine crawlers included.

Make use of the 301 redirect when a requested Web resource was moved to another location, or when a user agent requests an URI which is definitely wrong and you’re able to tell the correct URI with no doubt. For URL canonicalization purposes (more info here) the 301 redirect is your one and only friend.

You must not recycle any 301′ing URLs, that means once an URL responds with 301 you must stick with it, you can’t reuse this URL for other purposes next year or so.

Also, you must maintain the 301 response and a location corresponding to the redirecting URL forever. That does not mean that the location can’t be changed. Say you’ve moved a contact page /contact.html to a CMS where it resides under /cms/contact.php. If a user agent requests /contact.html it does a 301 redirect pointing to /cms/contact.php. Two years later you change your software again, and the contact page moves to /blog/contact/. In this case you must change the initial redirect, and create a new one:
/contact.html 301-redirects to /blog/contact/, and
/cms/contact.php 301-redirects to /blog/contact/.
If you keep the initial redirect /contact.html to /cms/contact.php, and redirect /cms/contact.php to /blog/contact/, you create a redirect chain which can deindex your content at search engines. Well, two redirects before a crawler reaches the final URL shouldn’t be a big deal, but add a canonicalization redirect fixing a www vs. non-www issue to the chain, and imagine a crawler comes from a directory or links list which counts clicks with a redirect script, you’ve four redirects in a row. That’s too much, most probably all search engines will not index such an unreliable Web resource.

301 redirects transfer search engine love like PageRank gathered by the redirecting URL to the new location, but the search engines keep the old URL in their indexes, and revisit it every now and then to check whether the 301 redirect is stable or not. If the redirect is gone on the next crawl, the new URL loses the reputation earned from the redirect’s inbound links. It’s impossible to get all inbound links changed, hence don’t delete redirects after a move.

It’s a good idea to check your 404 logs weekly or so, because search engine crawlers pick up malformed links from URL drops and such. Even when the link is invalid, for example because a crappy forum software has shortened the URL, it’s an asset you should not waste with a 404 or even 410 response. Find the best matching existing URL and do a 301 redirect.

Here is what Google says about 301 redirects:

[Source] 301 (Moved permanently) […] You should use this code to let Googlebot know that a page or site has permanently moved to a new location. […]

[Source …] If you’ve restructured your site, use 301 redirects (”RedirectPermanent”) in your .htaccess file to smartly redirect users, Googlebot, and other spiders. (In Apache, you can do this with an .htaccess file; in IIS, you can do this through the administrative console.) […]

[Source …] If your old URLs redirect to your new site using HTTP 301 (permanent) redirects, our crawler will discover the new URLs. […] Google listings are based in part on our ability to find you from links on other sites. To preserve your rank, you’ll want to tell others who link to you of your change of address. […]

[Source …] If your site [or page] is appearing as two different listings in our search results, we suggest consolidating these listings so we can more accurately determine your site’s [page’s] PageRank. The easiest way to do so [on site level] is to set the preferred domain using our webmaster tools. You can also redirect one version [page] to the other [canonical URL] using a 301 redirect. This should resolve the situation after our crawler discovers the change. […]

That’s exactly what the HTTP standard wants a search engine to do. Yahoo handles 301 redirects a little different:

[Source …] When one web page redirects to another web page, Yahoo! Web Search sometimes indexes the page content under the URL of the entry or “source” page, and sometimes index it under the URL of the final, destination, or “target” page. […]

When a page in one domain redirects to a page in another domain, Yahoo! records the “target” URL. […]

When a top-level page [] in a domain presents a permanent redirect to a page deep within the same domain, Yahoo! indexes the “source” URL. […]

When a page deep within a domain presents a permanent redirect to a page deep within the same domain, Yahoo! indexes the “target” URL. […]

Because of mapping algorithms directing content extraction, Yahoo! Web Search is not always able to discard URLs that have been seen as 301s, so web servers might still see crawler traffic to the pages that have been permanently redirected. […]

As for the non-standard procedure to handle redirecting root index pages, that’s not a big deal, because in most cases a site owner promotes the top level page anyway. Actually, that’s a smart way to “break the rules” for the better. The way too many requests of permanently redirecting pages are more annoying.

Moving sites with 301 redirects

When you restructure a site, consolidate sites or separate sections, move to another domain, flee from a free host, or do other structural changes, then in theory you can install page by page 301 redirects and you’re done. Actually, that works but comes with disadvantages like a total loss of all search engine traffic for a while. As larger the site, as longer the while. With a large site highly dependent on SERP referrers this procedure can be the first phase of a filing for bankruptcy plan, because all search engines don’t send (much) traffic during the move.

Lets look at the process from a search engine’s perspective. The crawling of all of a sudden bounces at 301 redirects to None of the redirect targets is known to the search engine. The crawlers report back redirect responses and the new URLs as well. The indexers spotting the redirects block the redirecting URLs for the query engine, but can’t pass the properties (PageRank, contextual signals and so on) of the redirecting resources to the new URLs, because those aren’t crawled yet.

The crawl scheduler initiates the handshake with the newly discovered server to estimate its robustness, and most propably does a conservative guess of the crawl frequency this server can sustain. The queue of uncrawled URLs belonging to the new server grows way faster than the crawlers actually deliver the first contents fetched from the new server.

Each and every URL fetched from the old server vanishes from the SERPs in no time, whilst the new URLs aren’t crawled yet, or are still waiting for an idle indexer able to assign them the properties of the old URLs, doing heuristic checks on the stored contents from both URLs and whatnot.

Slowly, sometimes weeks after the begin of the move, the first URLs from the new server populate the SERPs. They don’t rank very well, because the search engine has not yet discovered the new site’s structure and linkage completely, so that a couple of ranking factors stay temporairily unconsidered. Some of the new URLs may appear as URL-only listing, solely indexed based on off-page factors, hence lacking the ability to trigger search query relevance for their contents.

Many of the new URLs can’t regain their former PageRank in the first reindexing cycle, because without a complete survey of the “new” site’s linkage there’s only the PageRank from external inbound links passed by the redirects available (internal links no longer count for PageRank when the search engine discovers that the source of internally distributed PageRank does a redirect), so that they land in a secondary index.

Next, the suddenly lower PageRank results in a lower crawling frequency for the URLs in question. Also, the process removing redirecting URLs still runs way faster than the reindexing of moved contents from the new server. As more URLs are involved in a move, as longer the reindexing and reranking lasts. Replace Google’s very own PageRank with any term and you’ve a somewhat usable description of a site move handled by Yahoo, MSN, or Ask. There are only so many ways to handle such a challenge.

That’s a horror scenario, isn’t it? Well, at Google the recently changed infrastructure has greatly improved this process, and other search engines evolve too, but moves as well as significant structural changes will always result in periods of decreased SERP referrers, or even no search engine traffic at all.

Does that mean that big moves are too risky, or even not doable? Not at all. You just need deep pockets. If you lack a budget to feed the site with PPC or other bought traffic to compensate an estimated loss of organic traffic lasting at least a few weeks, but perhaps months, then don’t move. And when you move, then set up a professionally managed project, and hire experts for this task.

Here are some guidelines. I don’t provide a timeline, because that’s impossible without detailed knowledge of the individual circumstances. Adapt the procedure to fit your needs, nothing’s set in stone.

  • Set up the site on the new Web server ( In robots.txt block everything exept a temporary page telling that this server is the new home of your site. Link to this page to get search engines familiar with the new server, but make sure there are no links to blocked content yet.
  • Create mapping tables “old URL to new URL” (respectively algos) to prepare the 301 redirects etcetera. You could consolidate multiple pages under one redirect target and so on, but you better wait with changes like that. Do them after the move. When you keep the old site’s structure on the new server, you make the job easier for search engines.
  • If you plan to do structural changes after the move, then develop the redirects in a way that you can easily change the redirect targets on the old site, and prepare the internal redirects on the new site as well. In any case, your redirect routines must be able to redirect or not depending on parameters like site area, user agent / requestor IP and such stuff, and you need a flexible control panel as well as URL specific crawler auditing on both servers.
  • On develop a server sided procedure which can add links to the new location on every page on your old domain. Identify your URLs with the lowest crawling frequency. Work out a time table for the move which considers page importance (with regard search engine traffic), and crawl frequency.
  • Remove the Disallow: statements in the new server’s robots.txt. Create one or more XML sitemap(s) for the new server and make sure that you set crawl-priority and change-frequency accurately, last-modified gets populated with the scheduled begin of the move (IOW the day the first search engine crawler can access the sitemap). Feed the engines with sitemap files listing the important URLs first. Add sitemap-autodiscovery statements to robots.txt, and manually submit the sitemaps to Google and Yahoo.
  • Fire up the scripts creating visible “this page will move to [new location] soon” links on the old pages. Monitor the crawlers on the new server. Don’t worry about duplicate content issues in this phase, “move” in the anchor text is a magic word. Do nothing until the crawlers have fetched at least the first and second link level on the new server, as well as most of the important pages.
  • Briefly explain your redirect strategy in robots.txt comments on both servers. If you can, add obversely HTML comments to the HEAD section of all pages on the old server. You will cloak for a while, and things like that can help to pass reviews by humans which might get an alert from an algo or spam report. It’s more or less impossible to redirect human traffic in chunks, because that results in annoying surfing experiences, inconsistent database updates, and other disadvantages. Search engines aren’t cruel and understand that.
  • 301 redirect all human traffic to the new server. Serve search engines the first chunk of redirecting pages. Start with a small chunk of not more than 1,000 pages or so, and bundle related pages to preserve most of the internal links within each chunk.
  • Closely monitor the crawling and indexing process of the first chunk, and don’t release the next one before it has (nearly) finished. Probably it’s necessary to handle each crawler individually.
  • Whilst you release chunk after chunk of redirects to the engines adjusting the intervals based on your experiences, contact all sites linking to you and ask for URL updates (bear in mind to delay these requests for inbound links pointing to URLs you’ll change after the move for other reasons). It helps when you offer an incentive, best let your marketing dept. handle this task (having a valid reason to get in touch with those Webmasters might open some opportunities).
  • Support the discovery crawling based on redirects and updated inbound links by releasing more and more XML sitemaps on the new server. Enabling sitemap based crawling should somewhat correlate to your release of redirect chunks. Both discovery crawling and submission based crawling share the bandwith respectively the amount of daily fetches the crawling engine has determined for your new server. Hence don’t disturb the balance by submitting sitemaps listing 200,000 unimportant 5th level URLs whilst a crawler processes a chunk of landing pages promoting your best selling products. You can steer sitemap autodiscovery depending on the user agent (for MSN and Ask which don’t offer submit forms) in your robots.txt, in combination with submissions to Google and Yahoo. Don’t forget to maintain (delete or update frequently) the sitemaps after the move.
  • Make sure you can control your redirects forever. Pay the hosting service and the registrar of the old site for the next ten years upfront. ;)

Of course there’s no such thing as a bullet-proof procedure to move large sites, but you can do a lot to make the move as smoothly as possible.

302 - Found [Elsewhere]

302 Found ElsewhereThe 302 redirect, like the 303/307 response code, is kinda soft redirect. Whilst a 301-redirect indicates a hard redirect by telling the user agent that a requested address is outdated (should be deleted) and the resource must be requested under another URL, 302 (303/307) redirects can be used with URLs which are valid, and should be kept by the requestor, but don’t deliver content at the time of the request. In theory, a 302′ing URL could redirect to another URL with each and every request, and even serve contents itself every now and then.

Whilst that’s no big deal with user agents used by humans (browsers, screen readers), search engines crawling and indexing contents by following paths to contents which must be accessible for human surfers consider soft redirects unreliable by design. What makes indexing soft redirets a royal PITA is the fact that most soft redirects actually are meant to notify a permanent move. 302 is the default response code for all redirects, setting the correct status code is not exactly popular in developer crowds, so that gazillions of 302 redirects are syntax errors which mimic 301 redirects.

Search engines have no other chance than requesting those wrongly redirecting URLs over and over to persistently check whether the soft redirect’s functionality sticks with the implied behavior of a permanent redirect.

Also, way back when search engines interpreted soft redirects according to the HTTP standards, it was possible to hijack foreign resources with a 302 redirect and even meta refreshes. That means that a strong (high PageRank) URL 302-redirecting to a weaker (lower PageRank) URL on another server got listed on the SERPs with the contents pulled from the weak page. Since Internet marketers are smart folks, this behavior enabled creative content delivery: of course only crawlers saw the redirect, humans got a nice sales pitch.

With regard to search engines, 302 redirects should be applied very carefully, because ignorant developers and, well, questionable intentions, have forced the engines to handle 302 redirects in a way that’s not exactly compliant to Web standards, but meant to be the best procedure to fit a searchers interests. When you do cross-domain 302s, you can’t predict whether search engines pick the source, the target, or even a completely different but nice looking URL from the target domain on their SERPs. In most cases the target URL of 302-redirects gets indexed, but according to Murphy’s law and experience of life “99%” leaves enough room for serious messups.

Partly the common 302-confusion is based on the HTTP standard(s). With regard to SEO, response codes usable with GET and HEAD requests are more important, so I simplify things by ignoring issues with POST requests. Lets compare the definitions:

HTTP/1.0 HTTP/1.1
302 Moved Temporarily

The requested resource resides temporarily under a different URL. Since the redirection may be altered on occasion, the client should continue to use the Request-URI for future requests.

The URL must be given by the Location field in the response. Unless it was a HEAD request, the Entity-Body of the response should contain a short note with a hyperlink to the new URI(s).

302 Found

The requested resource resides temporarily under a different URI. Since the redirection might be altered on occasion, the client SHOULD continue to use the Request-URI for future requests. This response is only cacheable if indicated by a Cache-Control or Expires header field.

The temporary URI SHOULD be given by the Location field in the response. Unless the request method was HEAD, the entity of the response SHOULD contain a short hypertext note with a hyperlink to the new URI(s).

First, there’s a changed reason phrase for the 302 response code. “Moved Temporarily” became “Found” (”Found Elsewhere”), and a new response code 307 labelled “Temporary Redirect” was introduced (the other new response code 303 “See Other” is for POST results redirecting to a resource which requires a GET request).

Creatively interpreted, this change could indicate that we should replace 302 redirects applied to temporarily moved URLs with 307 redirects, reserving the 302 response code for hiccups and redirects done by the Web server itself –without an explicit redirect statement in the server’s configuration (httpd.conf or .htaccess)–, for example in response to requests of maliciously shortened URIs (of course a 301 is the right answer in this case, but some servers use the “wrong” 302 response code by default to err on the side of caution until the Webmaster sets proper canonicalization redirects returning 301 response codes).

Strictly interpreted, this change tells us that the 302 response code must not be applied to moved URLs, regardless whether the move is really a temporary replacement (during maintenance windows, to point to mirrors of pages on overcrowded servers during traffic spikes, …) or even a permanent forwarding request where somebody didn’t bother sending a status line to qualify the location directive. As for maintenance, better use 503 “Service Unavailable”!

Another important change is the addition of the non-cachable instruction in HTTP/1.1. Because the HTTP/1.0 standard didn’t explicitely state that the URL given in location must not be cached, some user agents did so, and the few Web developers actually reading the specs thought they’re allowed to simplify their various redirects (302′ing everything), because in the eyes of a developer nothing is really there to stay (SEOs, who handle URLs as assets, often don’t understand this philosophy, thus sadly act confrontational instead of educational).

Having said all that, is there still a valid use case for 302 redirects? Well, since 307 is an invalid response code with HTTP/1.0 requests, and crawlers still perform those, there’s no alternative to 302. Is that so? Not really, at least not when you’re dealing with overcautious search engine crawlers. Most HTTP/1.0 requests from search engines are faked, that means the crawler understands everything HTTP/1.1 but sends an HTTP/1.0 request header just in case the server runs since the Internet’s stone age without any upgrades. Yahoo’s Slurp for example does faked HTTP/1.0 requests in general, whilst you can trust Ms. Googlebot’s request headers. If Google’s crawler does an HTTP/1.0 request, that’s either testing the capabilities of a newly discovered server, or something went awfully wrong, usually on your side.

Google’s as well as Yahoo’s crawlers understand both the 302 and the 307 redirect (there’s no official statement from Yahoo though). But there are other Web robots out there (like link checkers of directories or similar bots send out by site owners to automatically remove invalid as well as redirecting links), some of them consisting of legacy code. Not to speak of ancient browsers in combination with Web servers which don’t add the hyperlink piece to 307 responses. So if you want to do everything the right way, you send 302 responses to HTTP/1.0 requestors –except when the user agent and the IP address identify a major search engine’s crawler–, and 307 responses to everything else –except when the HTTP/1.1 user agent lacks understanding of 307 response codes–. Ok, ok, ok … you’ll stick with the outdated 302 thingy. At least you won’t change old code just to make it more complex than necessary. With newish applications, which rely on state of the art technologies like AJAX anyway, you can quite safely assume that the user agents understand the 307 response, hence go for it and bury the wrecked 302, but submit only non-redirecting URLs to other places.

Here is how Google handles 302 redirects:

[Source …] you shouldn’t use it to tell the Googlebot that a page or site has moved because Googlebot will continue to crawl and index the original location.

Well, that’s not much info, and obviously a false statement. Actually, Google continues to crawl the redirecting URL, then indexes the source URL with the target’s content from redirects within a domain or subdomain only –but not always–, and mostly indexes the target URL and its content when a 302 redirect leaves the domain of the redirecting URL –if not any other URL redirecting to the same location or serving the same content looks prettier–. In most cases Google indexes the content served by the target URL, but in some cases all URL candidates involved in a redirect lose this game in favor of another URL Google has discovered on the target server (usually a short and pithy URL).

Like with 301 redirects, Yahoo “breaks the rules” with 302 redirects too:

[Source …] When one web page redirects to another web page, Yahoo! Web Search sometimes indexes the page content under the URL of the entry or “source” page, and sometimes index it under the URL of the final, destination, or “target” page. […]

When a page in one domain redirects to a page in another domain, Yahoo! records the “target” URL. […]

When a page in a domain presents a temporary redirect to another page in the same domain, Yahoo! indexes the “source” URL.

Yahoo! Web Search indexes URLs that redirect according to the general guidelines outlined above with the exception of special cases that might be read and indexed differently. […]

One of these cases where Yahoo handles redirects “differently” (meaning according to the HTTP standards) is a soft redirect from the root index page to a deep page. Like with a 301 redirect, Yahoo indexes the home page URL with the contents served by the redirect’s target.

You see that there are not that much advantages of 302 redirects pointing to other servers. Those redirects are most likely understood as somwhat permanent redirects, what means that the engines most probably crawl the redirecting URLs in a lower crawl frequency than 307 redirects.

If you have URLs which change their contents quite frequently by redirecting to different resources (from the same domain or on another server), and you want search engines to index and rank those timely contents, then consider the hassles of IP/UA based response codes depending on the protocol version. Also, feed those URLs with as much links as you can, and list them in an XML sitemap with a high priority value, a last modified timestamp like request timestamp minus a few seconds, and an “always”, “hourly” or “daily” change frequency tag. Do that even when you for whatever reasons have no XML-sitemap at all. There’s no better procedure to pass such special instructions to crawlers, even an XML sitemap listing only the ever changing URLs should do the trick.

If you promote your top level page but pull the contents from deep pages or scripts, then a 302 meant as 307 from the root to the output device is a common way to avoid duplicate content issues while serving contents depending on other request signals than the URI alone (cookies, geo targeting, referrer analysis, …). However, that’s a case where you can avoid the redirect. Duplicating one deep page’s content on root level is a non-issue, a superfluous redirect is an issue with regard to performance at least, and it sometimes slows down crawling and indexing. When you output different contents depending on user specific parameters, treating crawlers as users is easy to accomplish. I’d just make the root index default document a script outputting the former redirect’s target. That’s a simple solution without redirecting anyone (which sometimes directly feeds the top level URL with PageRank from user links to their individual “home pages”).

307 - Temporary Redirect

307 Temporary RedirectWell, since the 307 redirect is the 302’s official successor, I’ve told you nearly everything about it in the 302 section. Here is the HTTP/1.1 definition:

307 Temporary Redirect

The requested resource resides temporarily under a different URI. Since the redirection MAY be altered on occasion, the client SHOULD continue to use the Request-URI for future requests. This response is only cacheable if indicated by a Cache-Control or Expires header field.

The temporary URI SHOULD be given by the Location field in the response. Unless the request method was HEAD, the entity of the response SHOULD contain a short hypertext note with a hyperlink to the new URI(s), since many pre-HTTP/1.1 user agents do not understand the 307 status. Therefore, the note SHOULD contain the information necessary for a user to repeat the original request on the new URI.

The 307 redirect was introduced with HTTP/1.1, hence some user agents doing HTTP/1.0 requests do not understand it. Some! Actually, many user agents fake the protocol version in order to avoid conflicts with older Web servers. Search engines like Yahoo for example perform faked HTTP/1.0 requests in general, although their crawlers do talk HTTP/1.1. If you make use of the feedburner plugin to redirect your WordPress feeds to, respectively resolving to, you’ll notice that Yahoo bots do follow 307 redirects, although Yahoo’s official documentation does not even mention the 307 response code.

Google states how they handle 307 redirects as follows:

[Source …] The server is currently responding to the request with a page from a different location, but the requestor should continue to use the original location for future requests. This code is similar to a 301 in that for a GET or HEAD request, it automatically forwards the requestor to a different location, but you shouldn’t use it to tell the Googlebot that a page or site has moved because Googlebot will continue to crawl and index the original location.

Well, a summary of the HTTP standard plus a quote from the 302 page is not exactly considered a comprehensive help topic. However, checked with the feedburner example, Google understands 307s as well.

A 307 should be used when a particular URL for whatever reason must point to an external resource. When you for example burn your feeds, redirecting your blog software’s feed URLs with a 307 response code to “your” feed at or another service is the way to go. In this case it plays no role that many HTTP/1.0 user agents don’t know shit about the 307 response code, because all software dealing with RSS feeds can understand and handle HTTP/1.1 response codes, or at least can interpret the class 3xx and request the feed from the URI provided in the header’s location field. More important, because with a 307 redirect each revisit has to start at the redirecting URL to fetch the destination URI, you can move your burned feed to another service, or serve it yourself, whenever you choose to do so, without dealing with longtime cache issues.

302 temporary redirects might result in cached addresses from the location’s URL due to an unprecise specification in the HTTP/1.0 protocol, but that shouldn’t happen with HTTP/1.1 response codes which, in the 3xx class, all clearly tell what’s cachable and what not.

When your site’s logs show a tiny amount of actual HTTP/1.0 requests (eliminate crawlers of major search engines for this report), you really should do 307 redirects instead of wrecked 302s. Of course, avoiding redirects where possible is always the better choice, and don’t apply 307 redirects to moved URLs.


301-302-307-redirect-recapHere are the bold sentences again. Hop to the sections via the table of contents.

  • Avoid redirects where you can. URLs, especially linked URLs, are assets. Often you can include other contents instead of performing a redirect to another resource. Also, there are hyperlinks.
  • Search engines process HTTP redirects (301, 302 and 307) as well as meta refreshes. If you can, always go for the cleaner server sided redirect.
  • Always redirect to the final destination to avoid useless hops which kill your search engine traffic. With each and every revamp that comes with URL changes check for incoming redirects and make sure that you eliminate unnecessary hops.
  • You must maintain your redirects forever, and you must not remove (permanent) redirects. Document all redirects, especially when you do redirects both in the server configuration as well as in scripts.
  • Check your logs for redirects done by the Web server itself and unusual 404 errors. Vicious Web services like Yahoo or MSN screw your URLs to get you in duplicate content troubles with Google.
  • Don’t track links with redirecting scripts. Avoid redirect scripts in favor of link attributes. Don’t hoard PageRank by routing outgoing links via an uncrawlable redirect script, don’t buy too much of the search engine FUD, and don’t implement crappy advice from Webmaster hangouts.
  • Clever redirects are your friend when you handle incoming and outgoing affiliate links. Smart IP/UA based URL cloaking with permanent redirects makes you independent from search engine canonicalization routines which can fail, and improves your overall search engine visibility.
  • Do not output anything before an HTTP redirect, and terminate the script after the last header statement.
  • For each server sided redirect, send an HTTP status line with a well choosen response code, and an absolute (fully qualified) URL in the location field. Consider tagging the redirecting script in the header (X-Redirect-Src).
  • Put any redirect logic at the very top of your scripts. Encapsulate redirect routines. Performance is not everything, transparency is important when the shit hits the fan.
  • Test all your redirects with server header checkers for the right response code and a working location. If you forget an HTTP status line, you get a 302 redirect regarless your intention.
  • With canonicalization redirects use not equal conditions to cover everything. Most .htaccess code posted on Webmaster boards, supposed to fix for example www vs. non-www issues, is unusable. If you reply “thanks” to such a post with your URL in the signature, you invite saboteurs to make use of the exploits.
  • Use only 301 redirects to handle permanently moved URLs and canonicalization. Use 301 redirects only for persistent decisions. In other words, don’t blindly 301 everything.
  • Don’t redirect too many URLs simultaneous, move large amounts of pages in smaller chunks.
  • 99% of all 302 redirects are either syntax errors or semantically crap, but there are still some use cases for search engine friendly 302 redirects. “Moved URLs” is not on that list.
  • The 307 redirect can replace most wrecked 302 redirects, at least in current environments.
  • Search engines do not handle redirects according to the HTTP specs any more. At least not when a redirect points to an external resource.

I’ve asked Google in their popular picks campaign for a comprehensive write-up on redirects (what is part of the ongoing help system revamp anyway, but I’m either greedy or not patient enough). If my question gets picked, I’ll update this post.

Did I forget anything else? If so, please submit a comment. ;)

Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments

Shit happens, your redirects hit the fan!

confused spiderAlthough robust search engine crawlers are rather fault-tolerant creatures, there is an often overlooked but quite safe procedure to piss off the spiders. Playing redirect ping pong mostly results in unindexed contents. Google reports chained redirects under the initially requested URL as URLs not followed due to redirect errors, and recommends:

Minimize the number of redirects needed to follow a link from one page to another.

The same goes for other search engines, they can’t handle longish chains of redirecting URLs. In other words: all search engines consider URLs involved in longish redirect chains unreliable, not trustworthy, low quality …

What’s that to you? Well, you might play redirect ping pong with search engine crawlers unknowingly. If you’ve ever redesigned a site, chances are you’ve build chained redirects. In most cases those chains aren’t too complex, but it’s worth checking. Bear in mind that Apache, .htaccess, scripts or CMS software and whatnot can perform redirects, often without notice and undetectable with a browser.

I made up this example, but I’ve seen worse redirect chains. Here is the transcript of Ms. Googlebot’s chat with your Web server:
crappy redirect chain

Googlebot: Now that’s a nice link I’ve discovered on this old and trusted page. I can’t wait to fetch it. Hey port 80 at, would you please be so kind to serve me /some-page?

.htaccess: Oh silly Googlebot, don’t you read Matt’s blog? He told me that a 301 redirect is the canonical answer when someone requests my stuff without the www-prefix. I didn’t bother to lookup the resource you’ve asked for, and why should I since your request is wrong, wrong, wrong! Here is the canonical URL: 301-Moved permanently, Location:

Googlebot: Dear Web server, of course that’s the best thing to do. I apologize for my silly request, but please note that I’m following a link where the Webmaster of the authority site forgot the dup-dup-dup stuff in the HREF attribute’s value. Can you please serve me /some-page now?

Apache: Doh! Why does this silly bot bother me with an incomplete URI? Since /some-page is not an existing file, probably it’s a directory and a brain dead Webmaster forgot the trailing slash. Lets see … yep, that’s it, and it has even a default document index.php. Dear Ms. Googlebot, here is the correct URL: 302-Found elsewhere, Location:

[Most servers will respond with 301-Moved permanently, Location:, but however the server is configured, it will be a redirect.]

Googlebot: Oh I see, I’ve got a really malformed URI. Sorry for bothering you again, could you please deliver the content of Thanks in advance!

/some-page/index.php: Uuaahhh … who’s knocking? I slept sooo long … not fully awake … please be patient while Apache hands me over to the PHP parser. Actually, I’m totally outdated because my Webmaster discovered some blogging software and moved my contents to a WordPress database. Please wait. Where is the darn post-it with the new address? Ahh, just found it under my pillow. Please don’t disturb me again, bye bye, and here you go: 301-Moved permanantly, Location:

Googlebot: You’re kidding! Why didn’t you give me this URI in the first place? Never mind, I want now.

Hack in /blog/index.php: Oh my god, another bot following outdated links. My Webmaster is clever, he read somewhere that category names in URLs boost search engine rankings and changed the permalink structure to category/postname. Wait a moment, I’m looking up the category for this post, be right back. Hmmm, this category name is not exactly a money term … the heck with it! Here is the keyword stuffed URL I just made for you: 301-Moved permanantly, Location:

Googlebot: Four crappy redirects in a row, what a bunch of bull! This is my ultimately last try: gimme!

WordPress: See, that didn’t hurt, did it? Here you are: 200-OK, the content

Googlebot: Here’s your shabby page. Don’t send me to this assclowns again! I’ve better things to do than running five times to this misconfigured server only to fetch a keyword laden sales pitch with 34 grammar errors, 67 typos, and a ton of affiliate links. Grrrr!

Crawl scheduler: Sorry dear. I can’t blacklist them for stuff like that, but I’ve set the crawl frequency for to once a year, and I’ve red-flagged the document ID so that the indexer can downrank it accordingly.

Do you really want to treat Ms. Googlebot so badly? Not to speak of the minus points you gain for playing redirect ping pong with a search engine. Maybe most search engines index a page served after four redirects, but I won’t rely on such a redirect chain. It’s quite easy to shorten it. Just delete outdated stuff so that all requests run into a 404-Not found, then write up a list in a format like

Old URI 1 Delimiter New URI 1 \n
Old URI 2 Delimiter New URI 2 \n
  … Delimiter   … \n

and write a simple redirect script which reads this file and performs a 301 redirect to New URI when REQUEST_URI == Old URI. If REQUEST_URI doesn’t match any entry, then send a 404 header and include your actual error page. If you need to change the final URLs later on, you can easily do that in the text file’s right column with search and replace.

Next point the ErrorDocument 404 directive in your root’s .htaccess file to this script. Done. Not looking at possible www/non-www canonicalization redirects, you’ve shortened the number of redirects to one, regardless how often you’ve moved your pages. Don’t forget to add all outdated URLs to the list when you redesign your stuff again, and cover common 3rd party sins like truncating trailing slashes too. The flat file from the example above would look like:

/some-page Delimiter /blog/cat/some-post/ \n
/some-page/ Delimiter /blog/cat/some-post/ \n
/some-page/index.php Delimiter /blog/cat/some-post/ \n
/blog/some-post Delimiter /blog/cat/some-post/ \n
/blog/some-post/ Delimiter /blog/cat/some-post/ \n
  … Delimiter   … \n

With a large site consider a database table, processing huge flat files with every 404 error can come with disadvantages. Also, if you’ve patterns like /blog/post-name/ ==> /blog/cat/post-name/ then don’t generate and process longish mapping tables but cover these redirects algorithmically.

To gather URLs worth a 301 redirect use these sources:

  • Your server logs.
  • 404/301/302/… reports from your server stats.
  • Google’s Web crawl error reports.
  • Tools like XENU’s Link Sleuth which crawl your site and output broken links as well as all sorts of redirects, and can even check your complete Web space for orphans.
  • Sitemaps of outdated structures/site areas.
  • Server header checkers which follow all redirects to the final destination.

Disclaimer: If you suffer from IIS/ASP, free hosts, restrictive hosts like Yahoo or other serious maladies, this post is not for you.

I’m curious, does did your site play redirect ping pong with search engine crawlers?

Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments

How to fuck up click tracking with the JavaScript onclick trigger

Fuck up click trackingThere’s a somewhat heated debate over at Sphinn and many other places as well where folks call each other guppy and dumbass try to figure out whether a particular directory’s click tracking sinks PageRank distribution or not. Besides interesting replies from Matt Cutts, an essential result of this debate is that Sphinn will implement a dumbass button.

Usually I wouldn’t write about desperate PageRank junkies going cold turkey, not even as a TGIF post, but the reason why this blog directory most probably doesn’t pass PageRank is interesting, because it has nothing to do with onclick myths. Of course the existence of an intrinsic event handler (aka onclick trigger) in an A element alone has nothing to do with Google’s take on the link’s intention, hence an onclick event itself doesn’t pull a link’s ability to pass Google-juice.

To fuck up your click tracking you really need to forget everything you’ve ever read in Google’s Webmaster Guidelines. Unfortunately, Web developers usually don’t bother reading dull stuff like that and code the desired functionality in a way that Google as well as other search engines puke on the generated code. However, ignorance is no excuse when Google talks best practices.

Lets look at the code. Code reveals everything and not every piece of code is poetry. That’s crap:
.html: <a href=""
onclick="return o('sebastians-blog');"></a>

.js: function o(lnk){'/out/'+lnk+'.html'); return false; }

The script /out/sebastians-blog.html counts the click and then performs a redirect to the HREF’s value.

Why can and most probably will Google consider the hapless code above deceptive? A human visitor using a JavaScript enabled user agent clicking the link will land exactly where expected. The same goes for humans using a browser that doesn’t understand JS, and users surfing with JS turned off. A search engine crawler ignoring JS code will follow the HREF’s value pointing to the same location. All final destinations are equal. Nothing wrong with that. Really?

Nope. The problem is that Google’s spam filters can analyze client sided scripting, but don’t execute JavaScript. Google’s algos don’t ignore JavaScript code, they parse it to figure out the intent of links (and other stuff as well). So what does the algo do, see, and how does it judge eventually?

It understands the URL in HREF as definitive and ultimate destination. Then it reads the onclick trigger and fetches the external JS files to lookup the o() function. It will notice that the function returns an unconditional FALSE. The algo knows that the return value FALSE will not allow all user agents to load the URL provided in HREF. Even if o() would do nothing else, a human visitor with a JS enabled browser will not land at the HREF’s URL when clicking the link. Not good.

Next the statement loads, not (truncating the trailing slash is a BS practice as well, but that’s not the issue here). The URLs put in HREF and built in the JS code aren’t identical. That’s a full stop for the algo. Probably it does not request the redirect script to analyze its header which sends a Location: line. (Actually, this request would tell Google that there’s no deceiptful intent, just plain hapless and overcomplicated coding, what might result in a judgement like “unreliable construct, ignore this link” or so, depending on other signals available).

From the algo’s perspective the JavaScript code performs a more or less sneaky redirect. It flags the link as shady and moves on. Guess what happens in Google’s indexing process with pages that carry tons of shady links … those links not passing PageRank sounds like a secondary problem. Perhaps Google is smart enough not to penalize legit sites for, well, hapless coding, but that’s sheer speculation.

However, shit happens, so every once in a while such a link will slip thru and may even appear in reverse citation results like link: searches or Google Webmaster Central link reports. That’s enough to fool even experts like Andy Beard (maybe Google even shows bogus link data to mislead SEO researches of any kind? Never mind).

Ok, now that we know how not to implement onclick click tracking, here’s an example of a bullet-proof method to track user clicks with the onclick event:
<a href=""
onclick="return trackclick(this.href,;">
Sebastian's Pamphlets</a>
trackclick() is a function that calls a server sided script to store the click and returns TRUE without doing a redirect or opening a new window.

Here is more information on search engine friendly click tracking using the onlick event. The article is from 2005, but not outdated. Of course you can add onclick triggers to all links with a few lines of JS code. That’s good practice because it avoids clutter in the A elements and makes sure that every (external) link is trackable. For this more elegant way to track clicks the warnings above apply too: don’t return false and don’t manipulate the HREF’s URL.

Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments

Cat post: Life’s getting better!

Even the cat says life's getting betterSince I’ve moved this blog here my traffic has nicely improved. Skip the bragging, hop into the grub. During the past three weeks my feed subscriptions went up by 17%, and daily uniques by 460% (I never got stumbled at blogspot).

Thanks to Google’s newish BlitzIndexing my SERP referrers went up from 1 (August/24/2007) to 40 yesterday, naturally all long tail searches by the way. I didn’t count the bogus MSN referrer spam suggesting I rank for “yahoo”, “pontiac” and whatnot.

The search engine crawlers are quite buzzy too. When Ms. Googlebot was ready fetching all URLs 3-5 times each, Slurp and MSNbot wholehearted joined the game. Google has indexed roughly 500 pages, my Webmaster Central account counts 2,000 inbound links, shows PageRank, anchor text stats and all the other neat stuff. Yahoo has indexed 90 pages and 1,000 inbound links. Although MSN has crawled a lot, they’ve indexed a whopping 2 pages. Wow.

I’d be quite happy since my blog’s life is getting better, if there weren’t that info from Google via Matt Cutts’s blog telling me that 80% of my pages are considered useless crap, at least from Google’s perspective. That’s not a joke. Google dislikes 80% of this wonderful blog, although it contains only 20% Google bashing. Weird.

I repeated the searches below multiple times, so what I’ve spotted is not an isolated phenomenon, nor coincidence. Here’s what the standard site search query shows, 494 indexed pages:
Google site search for Sebastian's Pamphlets

Next I’ve used Matt’s power toy limiting the results to pages Google discovered within the past 30 days (&as_qdr=d30). Please note that 30 days ago this domain didn’t exist. I’ve installed WordPress on August/16/2007, the day I’ve registered the domain, that means 29 days ago I’ve created the very first indexable page. The rather astonishing result is 89 indexed pages:
Google site search for Sebastian's Pamphlets for the past 30 days

Either Matt’s time tunnel for power searchers is only 20% accurate, or 80% of my stuff went straight into the supplemental index from where advanced search can’t pull it.

The latter presumption is plausible, because the site is new, 99% or so of my deep links came in 2-3 weeks ago via 301 redirects so that the pages have no PageRank yet, and for most of the URLs Google noticed near duplicates with source bonus from my old blogspot outlet, not to speak of scraped stuff on low-life servers. Roughly 90 pages can have got noticable PageRank yet, judging from my understanding of my fresh inbound links and my internal linkage. Interestingly, those 90 pages on my blog have a real world timestap after the funeral of the blogspot thingy, and content that didn’t exist over there. That could lead to interesting theories, however I guess that indeed <speculation>time restricted searches don’t pull pages from the supplemental hell</speculation>. Reminds me of the fact that Google’s link stats show nofollow’ed links and all that, but not a single link from a page buried in the supp index.

Did Matt by accident reveal a sure-fire procedure to identify supplemental results? I mean they can’t make timely searches defunct like /& and undocumented stuff like that. I’ve tested the method with two sites where I know the supp ratio and the results were kinda plausible, but that’s no proof. Of course I couldn’t resist to post this vague piece of speculation before doing solid research. Maybe I’m dead wrong.

What do you guys think? Flame me in the comments. :)

Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments

Google says you must manage your affiliate links in order to get indexed

Screwing affiliates recommended by Google ;=)I’ve worked hard to overtake the SERP positions of a couple merchants allowing me to link to them with an affiliate ID, and now the allmighty Google tells the sponsors they must screw me with internal 301 redirects to rescue their rankings. Bugger. Since I read the shocking news on Google’s official Webmaster blog this morning I worked on a counter strategy, with success. Affiliate programs will not screw me, not even with Google’s help. They’ll be hoist by their own petard. I’ll strike back with nofollow and I’ll take no prisoners.

Seriously, the story reads a little different and is not breaking news at all. Maile Ohye from Google just endorsed best practices I’ve recommended for ages. Here is my recap.

The problem

Actually, there are problems on both sides of an affiliate link. The affiliate needs to hide these links from Google to avoid a so called “thin affiliate site penalty”, and the affiliate program suffers from duplicate content issues, link juice dilution, and often even URL hijacking by affiliate links.

Diligent affiliates gathering tons of PageRank on their pages can “unintentionally” overtake URLs on the SERPs by fooling the canonicalization algos. When Google discovers lots of links from strong pages on different hosts pointing to and this page adds ?affid=me to its internal links, my URL on the sponsor’s site can “outrank” the official home page, or landing page, When I choose the right anchor text, Google will feed my affiliate page with free traffic, whilst the affiliate program’s very own pages don’t exist on the SERPs.

Managing incoming affiliate links (merchants)

The best procedure is capturing all incoming traffic before a single byte of content is sent to the user agent, extracting the affiliate ID from the URL, storing it in a cookie, then 301-redirecting the user agent to the canonical version of the landing page, that is a page without affiliate or user specific parameters in the URL. That goes for all user agents (humans accepting the cookie and Web robots which don’t accept cookies and start a new session with every request).

Users not accepting cookies are redirected to a version of the landing page blocked by robots.txt, the affiliate ID sticks with the URLs in this case. Search engine crawlers, identified by their user agent name or whatever, are treated as users and shall never see (internal) links to URLs with tracking parameters in the query string.

This 301 redirect passes all the link juice, that is PageRank & Co. as well as anchor text, to the canonical URL. Search engines can no longer index page versions owned by affiliates. (This procedure doesn’t prevent you from 302 hijacking where your content gets indexed under the affiliate’s URL.)

Putting safe affiliate links (online marketers)

Honestly, there’s no such thing as a safe affiliate link, at least not safe with regard to picky search engines. Masking complex URLs with redirect services like or so doesn’t help, because the crawlers get the real URL from the redirect header and will leave a note in the record of the original link on the page carrying the affiliate link. Anyways, the tiny URL will fool most visitors, and if you own the redirect service it makes managing affiliate links easier.

Of course you can cloak the hell out of your thin affiliate pages by showing the engines links to authority pages whilst humans get the ads, but then better forget the Google traffic (I know, I know … cloaking still works if you can handle it properly, but not everybody can handle the risks so better leave that to the experts).

There’s only one official approach to make a page plastered with affiliate links safe with search engines: replace it with a content rich page, of course Google wants unique and compelling content and checks its uniqueness, then sensibly work in the commercial links. Best link within the content to the merchants, apply rel-nofollow to all affiliate links, and avoid banner farms in the sidebars and above the fold.

Update: I’ve sanitized the title, “Google says you must screw your affiliates in order to get indexed” was not one of my best title baits.

Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments

« Previous Page  1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13  Next Page »