2007 December

Monthly archive: December, 2007

Vote Now: Rubber Chicken Award 2007 for the dullest and most tedious search blog post

Posted on 28 December, 2007

Rubber Chicken Award - Top 10 Finalists I’m truly excited. Two of my pamphlets made it in The Rubber Chicken Award’s Top 10! That’s 50% success (2/4 nominated pamphlets), so please help me to make that 100%: vote for #3 and #4!

Just in case you, dear reader, are not a hardcore SEM addict who reads search blogs even during the holiday season, let me explain why a Rubber Chicken Award Top 10 nomination is a honor.

The Rubber Chicken Award honors the year’s most serious SEO research. Extra brownie points are given to the dullest draft and the most tedious wording.

Rumors are swirling that Google’s search ~~quality~~ spam task force has developed the complex RCAFHITSI^©™ algo^{patent pending®} which compiles and ranks search blog posts presented to Mike Blumenthals’s Rubber Chicken Award Jury:

Bill Slawski
Greg Sterling
Rand Fishkin
Danny Sullivan
Matt Cutts (suspected final arbiter and extremely busy, so we don’t know whether he’ll show up or not)

Here is the cream of the crop of the search world, the 2007 Top 10 search blog posts nominated in the Rubber Chicken Award for the dullest and most boring/serious SEO/SEM article:

Want traffic? Rank for High Traffic Keywords…
We Add Words to AdWords… Google Subtracts them
Why eBay and Wikipedia rule Google’s SERPs
SEOs home alone - Google’s nightmare
13 Things to Do When Your Loved One is Away at Conferences
SEO High School Confidential - Premiere Edition!
The Sphinn Awards - Part I & -Part II.
Top 21 Signs You Need a Break From SEO (2007 version)
10 Signs That You May Be a Blog Addict
The SEO’s Guide to Beginners
The Internet Marketer’s Nightmare
Mission Accomplished—Top Ranking in Google
Google Interiors - the day my house became searchable

I’ve selfishly marked the two posts you want to vote for. Because all nominations are truly awesome, just vote for everything but make sure to check “5” for #3 and #4:
VOTE NOW

Thank You, Dear Reader!

Update: I can’t post another voting whore call to action today, but of course I’d very much appreciate your vote in the Best SEO Blog of 2007 category at SEJ’s 2007 Search Blog Awards.

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

13 comments Sebastian | Internet Marketing, Blogging, Ego Food, Fun

Ping the hell out of Technorati’s reputation algo

Posted on 19 December, 2007

Ping your inbound links for technorati reputation If your Technorati reputation factor sucks ass then read on, otherwise happily skip this post.

Technorati calculates a blog’s authority/reputation based on its link popularity, counting blogroll links from the linking blogs main pages as well as links within the contents of their posts. Links older than six months after their very first discovery don’t count.

Unfortunately, Technorati is not always able to find all your inbound links, usually because clueless bloggers forget to ping them, hence your blog might be undervalued. You can change that.

Compile a list of blogs that link to you and are unknown at Technorati, then introduce them below to a cluster ping orgy. Technorati will increase your authority rating after indexing those blogs.

Enter one blog home page URL per line, all lines properly delimited with a “\n” (new line, just hit [RETURN]; “\r” crap doesn’t work). And make sure that all these blogs have an auto-discovery link pointing to a valid feed in their HEAD section. Do NOT ping Technorati with post-URIs! Invest the time to click through to the blog’s main page and submit the blog-URI instead. Post-URI pings get mistaken for noise and trigger spam traps, that means their links will not increase your Technorati authority/rank.

Results:

<br/>

Actually, this tool pings other services than Technorati too. Pingable contents make it on the SERPs, not only at Technorati.

If you make use of URL canonicalization routines that add a trailing slash to invalid URLs like http://example.com then make sure that you claim your blog at Technorati with the trailing slash.

Please note that this tool is experimental and expects a Web standard friendly browser. It might not work for you, and I’ll remove it if it gets abused.

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

37 comments Sebastian | Technorati, Blogging, Tools, SEO

No more RSS feeds in Google’s search results

Posted on 18 December, 2007

Folks try all sorts of naughty things when by accident a blog’s feed outranks the HTML version of a post. Usually that happened mostly to not that popular blogs, or with very old posts and categorized feeds that contain ancient articles.

The problem seems to be that Google’s Web search doesn’t understand the XML structure of feeds, so that a feed’s textual contents get indexed like stuff from text files. Due to “subscribe” buttons and other links, feeds can gather more PageRank than some HTML pages. Interestingly .xml is considered an unknown file type, and advanced search doesn’t provide a way to search within XML files.

Now that has changed¹. Googler Bogdan Stănescu posts on the German Webmaster blog² We remove feeds from our search results:

As Webmasters many of you were probably worried that your RSS or Atom feeds could outrank the accompanying HTML pages in Google’s search results. The emergence of feeds in our search results could be a poor user experience:

1. Feeds increase the probability that the user gets the same search result twice.

2. Users who click on the feed link on a SERP may miss out on valuable content, which is only available on the HTML page referenced in the XML file.

For these reasons, we have removed feeds from our Web search results - with the exception of podcasts (feeds with media files).

[…] We are aware that in addition to the podcasts out there some feeds exist that are not linked with an HTML page, and that is why it is not quite ideal to remove all feeds from the search results. We’re still open for feedback and suggestions for improvements to the handling of feeds. We look forward to your comments and questions in the crawling, indexing and ranking section of our discussion forum for Webmasters. [Translation mine]

I’m not yet sure whether or not that’s ending in a ban of all/most XML documents. I hope they suppress RSS/Atom feeds only, and provide improved ways to search for and within other XML resources.

So what does that mean for blog SEO? Unless Google provides a procedure to prevent feeds from accumulating PageRank whilst allowing access for blog search crawlers that request feeds (I believe something like that is in the works), it’s still a good idea to nofollow all feed links, but there’s absolutely no reason to block them in robots.txt any more.

I think that’s a great move into the right direction, but a preliminary solution, though. The XML structure of feeds isn’t that hard to parse, and there are only so many ways to extract the URL of the HTML page. Then when a relevant feeds lands in a raw result set, Google should display a link to the HTML version on the SERP. What do you think?

¹ Danny reminded me that according to Matt Cutts that’s going on for a few months now.

² 24 hours later Google published the announcement in English language too.

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

11 comments Sebastian | Blogging, Search Quality, SEO, Google

Nominate a red crab in the 2007 Search Blog Awards!

Posted on 17 December, 2007

Nominate the red SEO crab Today Loren asked for selfish nominations, thus everybody posts a call for action.
So did I:

• Best search related pamphlets
I hereby selfishly submit my blog.

To no avail:

Sebastian, we’re not going to have a category for Best Pamphlets, but good try

There’s no such thing as a Best Crabby Search Pamphlets category just because my blog would be the sole candidate? Ok, I understand that. Really. I didn’t even swear. Yet.

So here’s my call for action. Nominate your favorite blog (that’s mine of course!) in any of the following categories that match:

Best SEO Blog
You’d expect more marketing stuff from an SEO blog.
Best SEM Blog
You’d expect even more marketing stuff, as well as PPC and whatnot. I suck on both.
Best SEO Plugin for Wordpress
I never wrote a WordPress plugin. Actually, this year I hate WordPress because they messed up the database structure in version 2.3 without providing any documentation or at least a reasonable migration procedure. Also their coding standards suck ass and make me puke whenever I see WordPress code.
Best Search Agency Resource Blog
My employers don’t blog.
Best Link Building Blog
Link building pamphlets are rare nowadays.
Best Social Media Marketing or Optimization Blog
I don’t game social media.
Best Local Search Blog
I’m happy when I find my shoes before I leave the house, hence I can’t give any advice on local search.
Best Video Search Blog
I watch x-rated videos only. Probably posting geeky clips doesn’t qualify me.
Best Mobile Search Blog
When I’m on the road I usually search until I give up and ask a cabby for an escort. Cheating this way makes sure I’m not always too late, but doesn’t qualify me for mobile search consultancy.
Best Google Blog Not Owned by Google
I’m not in Google news.
Best Search Engine Corporate Blog (owned by the search engines)
Although I’ve developed a tiny search engine years ago, I fear that smutty results don’t count.
Best Contextual Advertising Blog
My organic traffic is cheaper, and probably as reliable as PPC campaigns.
Best Affiliate Marketing Blog
I sold two Seobook subscriptions recently, does that count?
Best Search Engine Community/Forum
I visit Sphinn and the Google Webmaster forum and never will launch a new forum again.
Best New Search Engine of 2007
See above.
Best Search Engine Research Blog
I revealed that Microsoft plans to relaunch Live Search as porn affiliate program, why eBay and Wikipedia rule Google’s SERPs, and more SEO research like that.
Best Search Linkbait of 2007
When I try it, folks bury it.
Breakout Blog of 2007
I’m blogging since 2005 but moved my blog away from blogspot this year.
Best Search Conference Coverage of 2007
I don’t even attend conferences.
Best Search Conference Coverage in Photos
See above.
Best Search Marketing Facebook Group
Facebook killed my account for spamming or so.
Most Giving Search Blogger
I can’t give away a fraction of Bill Slawski’s great insights.
Best Independent Search Blog (not owned by media company or marketing agency)
What does that mean? Ok, I’m in.
Best Search Blog Post of 2007
I wrote a dull book on redirects, and more.

Oh well. Instead of nominating my stuff better convince Search Engine Journal that they really need a Crabby Pamphlets Category. Or try Category #16 at Performancing.

Update December/28/2007: YAY! Thank you all! Now you can vote for my pamphlets in the “Best SEO Blog of 2007″ category at the SEJ Search Blog Award 2007 contest. Here are the candidates:

SEOmoz Blog

Sebastian’s Pamphlets

Search Engine Roundtable

Graywolf’s Wolf-Howl

Tropical SEO

PageTrafficBlog

SugarRae

SEO Scoop

Search Rank Blog

SEO by the SEA

Search Marketing Gurus

SEO Book

It truly is an honor just to be nominated together with these great SEO bloggers.

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

3 comments Sebastian | Blogging, Ego Food, Fun

BlogCatalog needs professional help

Posted on 13 December, 2007

BlogCatalog Devil A while ago I helped BlogCatalog to fix an issue with their JavaScript click tracking that Google considered somewhat crappy. The friendly BlogCatalog guys said thanks, and since then joining BC was on my ToDo-list because it seemed to be a decent service.

Recently I missed my cute red crab icon in a blog’s sidebar widget, realized that it’s powered by BlogCatalog and not MyBlogLog, so I finally signed up.

Roughly 24 hours later I was quite astonished as I received this email:

BlogCatalog - Submission Declined: Sebastian´s Pamphlets

Dear Sebastian,

Thank you for submitting your blog Sebastian`s Pamphlets (http://sebastians-pamphlets.com/) to BlogCatalog.

Unfortunately upon reviewing your blog we are unable to grant it access to the directory.

Your blog was declined for the following reason:

* You did not add a link back to Blog Catalog from your website.
To add a link visit: http://www.blogcatalog.com/buttons.php

If you believe this to be a mistake, you can login to Blog Catalog ( http://www.blogcatalog.com/blogs/manage_blog.html ) and change anything which may have caused it to get declined. After updating your blog, it will be put back into the submission queue.

If you have any questions/comments/suggestions/ideas please feel free to contact us.

Thanks,
BlogCatalog

Crap on, I followed the instructions on http://www.blogcatalog.com/buttons.php:

Meta Tag Verification

Id you’d rather not add a link back to BlogCatalog you can alternatively copy the meta tag listed below and paste it in your site’s home page in the first <head> section of the page, before the first <body> section.

<meta name=”blogcatalog” content=”9BC8674180″ />

It’s laughable to talk about the “first HEAD section” because an HTML file can have only one. Also having more than one BODY section is certainly not compliant to any standard. But bullshit aside, they clearly state that they’re fine with a meta tag if a blogger refuses to add a reciprocal link or even a pile of server sided code that slows down each and every page.

If I remember correctly, BC folks accused of hoarding PageRank defended their policy with statements like

I should quickly clear up that we provide also widgets and meta tags to verify ownership for anyone who doesn’t want to link back to us. We understand PageRank is sacred to many of our bloggers and give them the options to preserve their PR. [emphasis mine, also I’ve removed typos]

Not that I care much about PageRank leaks, but I never link to directories. And why should I when they can verify my submission in other ways?

Obviously, BlogCatalog staff can’t be bothered to view my home page’s source code, and they’ve no scripts capable to find the meta tag <meta name=”blogcatalog” content=”9BC8674180″ />
in my one and only and therefore first HEAD section.

The meta tag verification is somewhat buried on the policy page, it looks like BlogCatalog chases inbound links no matter what it costs. Dear BlogCatalog, in my case it costs reputation. You guys don’t really think that I send you a private message so that you can silently approve the declined sign up, don’t you? I’m pretty sure that you treat others the same way. Either dump the meta tag verification, or play by your very own rules.

It seems to me that BlogCatalog needs more professional advice from bright consultants (scroll down to Andy’s full disclosure).

Update: A few hours after publishing this post my submission got approved.

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

11 comments Sebastian | Blogging, Crap

Upgrading from IIS/ASP to Apache/PHP

Posted on 11 December, 2007

Upgrade from Windows/IIS/ASP to Unix/Apache/PHP Once you’re sick of IIS/ASP maladies you want to upgrade your Web site to utilize standardized technologies and reliable OpenSource software. On an Apache Web server with PHP your .asp scripts won’t work, and you can’t run MS-Access “databases” and such stuff under Apache.

Here is my idea of a smoothly migration from IIS/ASP to Apache/PHP. Grab any Unix box from your hoster’s portfolio and start over.

(Recently I got a tiny IIS/ASP site about uses & abuses of link condoms and moved it to an Apache server. I’m well known for brutal IIS rants, but so far I didn’t discuss a way out of such a dilemma, so I thought blogging this move could be a good idea.)

I don’t want to make this piece too complex, so I skip database and code migration strategies. Read Mike Hillyer’s article Migrating from Microsoft Access/MS-SQL to MySQL, and try tools like ASP to PHP. (With my tiny link condom site I overwrote the ASP code with PHP statements in my primitive text editor.)

From an SEO perspective such an upgrade comes with pitfalls:

Changing file extensions from .asp to .php is not an option. We want to keep the number of unavoidable redirects as low as possible.
Default.asp is usually not configured as a valid default document under Apache, hence requests of http://example.com/ run into 404 errors.
Basic server name canonicalization routines (www vs. non-www) from ASP scripts are not convertible.
IIS-URIs are not case sensitive, that means that /Default.asp will 404 on Apache when the filename is /default.asp. Usually there are lowercase/uppercase issues with query string variables and values as well.
Most probably search engines have URL variants in their indexes, so we want to adapt their URL canonicalization, at least where possible.
HTML editors like Microsoft Visual Studio tend to duplicate the HTML code of templated page areas. Instead of editing menus or footers in all scripts we want to encapsulate them.
If the navigation makes use of relative links, we need to convert those to absolute URLs.
Error handling isn’t convertible. Improper error handling can cause decreasing search engine traffic.

Running /default.asp, /home.asp etc. as PHP scripts

When you upload an .asp file to an Apache Web server, most user agents can’t handle it. Browsers treat them as unknown file types and force downloads instead of rendering them. Next those files aren’t parsed for PHP statements, provided you’ve rewritten the ASP code already.

To tell Apache that .asp files are valid PHP scripts outputting X/HTML, add this code to your server config or your .htaccess file in the root: AddType text/html .asp AddHandler application/x-httpd-php .asp
The first line says that .asp files shall be treated as HTML documents, and should force the server to send a Content-Type: text/html HTTP header. The second line tells Apache that it must parse .asp files for PHP code.

Just in case the AddType statement above doesn’t produce a Content-Type: text/html header, here is another way to tell all user agents requesting .asp files from your server that the content type for .asp is text/html. If you’ve mod_headers available, you can accomplish that with this .htaccess code: <IfModule mod_headers.c> SetEnvIf Request_URI \.asp is_asp=is_asp Header set "Content-type" "text/html" env=is_asp Header set imagetoolbar "no" </IfModule>
(The imagetoolbar=no header tells IE to behave nicely; you can use this directive in a meta tag too.)
If for some reason mod_headers doesn’t work well with mod_setenvif, giving 500 error codes or so, then you can set the content-type with PHP too. Add this to a PHP script file which is included in all your scripts at the very top: @header("Content-type: text/html", TRUE);
Instead of “text/html” alone, you can define the character set too: “text/html; charset=UTF-8″

Sanitizing the home page URL by eliminating “default.asp”

Instead of slowing down Apache by defining just another default document name (DirectoryIndex index.html index.shtml index.htm index.php [...] default.asp), we get rid of “/default.asp” with this “/index.php” script: <?php @require("default.asp"); ?>
Now every request of http://example.com/ executes /index.php which includes /default.asp. This works with subdirectories too.

Just in case someone requests /default.asp directly (search engines keep forgotten links!), we perform a permanent redirect in .htaccess: Redirect 301 /default.asp http://example.com/ Redirect 301 /Default.asp http://example.com/

Converting the ASP code for server name canonicalization

If you find ASP canonicalization routines like <%@ Language=VBScript %> <% if strcomp(Request.ServerVariables("SERVER_NAME"), "www.example.com", vbCompareText) = 0 then Response.Clear Response.Status = "301 Moved Permanently" strNewUrl = Request.ServerVariables("URL") if instr(1,strNewUrl, "/default.asp", vbCompareText) > 0 then strNewUrl = replace(strNewUrl, "/Default.asp", "/") strNewUrl = replace(strNewUrl, "/default.asp", "/") end if if Request.QueryString <> "" then Response.AddHeader "Location","http://example.com" & strNewUrl & "?" & Request.QueryString else Response.AddHeader "Location","http://example.com" & strNewUrl end if Response.End end if %>
(or the other way round) at the top of all scripts, just select and delete. This .htaccess code works way better, because it takes care of other server name garbage too: RewriteEngine On RewriteCond %{HTTP_HOST} !^example\.com [NC] RewriteRule (.*) http://example.com/$1 [R=301,L]
(you need mod_rewrite, that’s usually enabled with the default configuration of Apache Web servers).

Fixing case issues like /script.asp?id=value vs. /Script.asp?ID=Value

Probably a M$ developer didn’t read more than the scheme and server name chapter of the URL/URI standards, at least I’ve no better explanation for the fact that these clowns made the path and query string segment of URIs case-insensitive. (Ok, I have an idea, but nobody wants to read about M$ world domination plans.)

Just because -contrary to Web standards- M$ finds it funny to serve the same contents on request of /Home.asp as well as /home.ASP, such crap doesn’t fly on the World Wide Web. Search engines -and other Web services which store URLs- treat them as different URLs, and consider everything except one version duplicate content.

Creating hyperlinks in HTML editors by picking the script files from the Windows Explorer can result in HREF values like “/Script.asp”, although the file itself is stored with an all-lowercase name, and the FTP client uploads “/script.asp” to the Web server. There are more ways to fuck up file names with improper use of (leading) uppercase characters. Typos like that are somewhat undetectable with IIS, because the developer surfing the site won’t get 404-Not found responses.

Don’t misunderstand me, you’re free to camel-case file names for improved readability, but then make sure that the file system’s notation matches the URIs in HREF/SRC values. (Of course hyphened file names like “buy-cheap-viagra.asp” top the CamelCased version “BuyCheapViagra.asp” when it comes to search engine rankings, but don’t freak out about keywords in URLs, that’s ranking factor #202 or so.)

Technically spoken, converting all file names, variable names and values as well to all-lowercase is the simplest solution. This way it’s quite easy to 301-redirect all invalid requests to the canonical URLs.

However, each redirect puts search engine traffic at risk. Not all search engines process 301 redirects as they should (MSN Live Search for example doesn’t follow permanent redirects and doesn’t pass the reputation earned by the old URL over to the new URL). So if you’ve good SERP positions for “misspelled” URLs, it might make sense to stick with ugly directory/file names. Check your search engine rankings, perform [site:example.com] search queries on all major engines, and read the SERP referrer reports from the old site’s server stats to identify all URLs you don’t want to redirect. By the way, the link reports in Google’s Webmaster Console and Yahoo’s Site Explorer reveal invalid URLs with (internal as well as external) inbound links too.

Whatever strategy fits your needs best, you’ve to call a script handling invalid URLs from your .htaccess file. You can do that with the ErrorDocument directive: ErrorDocument 404 /404handler.php
That’s safe with static URLs without parameters and should work with dynamic URIs too. When you -in some cases- deal with query strings and/or virtual URIs, the .htaccess code becomes more complex, but handling virtual paths and query string parameters in the PHP scripts might be easier: <IfModule mod_rewrite.c> RewriteEngine On RewriteBase / RewriteCond %{REQUEST_FILENAME} !-f RewriteCond %{REQUEST_FILENAME} !-d RewriteRule . /404handler.php [L] </IfModule>
In both cases Apache will process /404handler.php if the requested URI is invalid, that is if the path segment (/directory/file.extension) points to a file that doesn’t exist.

And here is the PHP script /404handler.php:
View|hide PHP code. (If you’ve disabled JavaScript you can’t grab the PHP source code!)<?php // 404handler.php // called from .htaccess if the requested path doesn’t exist $thisFileName = "404handler.php"; // change this $canonicalScheme = "http://"; $canonicalServer = "example.com"; // change this $errorPageUri = "/error.asp"; // change this $documentRoot = $_SERVER["DOCUMENT_ROOT"]; $requestUri = $_SERVER["REQUEST_URI"]; $canonicalUri = ""; $requestedUrl = $canonicalScheme .$canonicalServer .$requestUri; $canonicalUrl = ""; $url = parse_url($requestedUrl); $requestPath = $url["path"]; $includeScript = ""; $queryString = $url["query"]; // keep misspelled URIs with nice search engine rankings if ("$requestPath" == "/Sample.asp") { // change this $includeScript = $documentRoot ."/sample.asp"; // change this } // … if (!empty($includeScript)) { @header("HTTP/1.1 200 OK", TRUE, 200); @include($includeScript); exit; } // if the lowercase version exists, redirect to it $lcPath = strtolower($url["path"]); $lcFile = $documentRoot .$lcPath; if (file_exists($lcFile) && !stristr($requestUri,$thisFileName)) { $canonicalUrl = $canonicalScheme .$canonicalServer .$lcPath; if ($queryString) { $canonicalUrl .= "?" .$queryString; } if ($url["fragment"]) { $canonicalUrl .= "#" .$url["fragment"]; } } if (!empty($canonicalUrl)) { @header("HTTP/1.1 301 Moved Permanently", TRUE, 301); @header("Location: $canonicalUrl"); exit; } // serve the 404 error page @header("HTTP/1.1 404 Not found", TRUE, 404); @include($documentRoot .$errorPageUri); exit; ?>
(Edit the values in all lines marked with “// change this”.)

This script doesn’t handle case issues with query string variables and values. Query string canonicalization must be developed for each individual site. Also, capturing misspelled URLs with nice search engine rankings should be implemented utilizing a database table when you’ve more than a dozen or so.

Lets see what the /404handler.php script does with requests of non-existing files.

First we test the requested URI for invalid URLs which are nicely ranked at search engines. We don’t care much about duplicate content issues when the engines deliver targeted traffic. Here is an example (which admittedly doesn’t rank for anything but illustrates the functionality): both /sample.asp as well as /Sample.asp deliver the same content, although there’s no /Sample.asp script. Of course a better procedure would be renaming /sample.asp to /Sample.asp, permanently redirecting /sample.asp to /Sample.asp in .htaccess, and changing all internal links accordinly.

Next we lookup the all lowercase version of the requested path. If such a file exists, we perform a permanent redirect to it. Example: /About.asp 301-redirects to /about.asp, which is the file that exists.

Finally, if everything we tried to find a suitable URI for the actual request failed, we send the client a 404 error code and output the error page. Example: /gimme404.asp doesn’t exist, hence /404handler.php responds with a 404-Not Found header and displays /error.asp, but /error.asp directly requested responds with a 200-OK.

You can easily refine the script with other algorithms and mappings to adapt its somewhat primitive functionality to your project’s needs.

Tweaking code for future maintenance

Legacy code comes with repetition, redundancy and duplication caused by developers who love copy+paste respectively copy+paste+modify, or Web design software that generates static files from templates. Even when you’re not willing to do a complete revamp by shoving your contents into a CMS, you must replace the ASP code anyway, what gives you the opportunity to encapsulate all templated page areas.

Say your design tool created a bunch of .asp files which all contain the same sidebars, headers and footers. When you move those files to your new server, create PHP include files from each templated page area, then replace the duplicated HTML code with <?php @include("header.php"); ?>, <?php @include("sidebar.php"); ?>, <?php @include("footer.php"); ?> and so on. Note that when you’ve HTML code in a PHP include file, you must add <?php ?> before the first line of HTML code or contents in included files. Also, leading spaces, empty lines and such which don’t hurt in HTML, can result in errors with PHP statements like header(), because those fail when the server has sent anything to the user agent (even a single space, new line or tab is too much).

It’s a good idea to use PHP scripts that are included at the very top and bottom of all scripts, even when you currently have no idea what to put into those. Trust me and create top.php and bottom.php, then add the calls (<?php @include("top.php"); ?> […] <?php @include("bottom.php"); ?>) to all scripts. Tomorrow you’ll write a generic routine that you must have in all scripts, and you’ll happily do that in top.php. The day after tomorrow you’ll paste the GoogleAnalytics tracking code into bottom.php. With complex sites you need more hooks.

Using absolute URLs on different systems

Another weak point is the use of relative URIs in links, image sources or references to feeds or external scripts. The lame excuse of most developers is that they need to test the site on their local machine, and that doesn’t work with absolute URLs. Crap. Of course it works. The first statement in top.php is @require($_SERVER["SERVER_NAME"] .".php");
This way you can set the base URL for each environment and your code runs everywhere. For development purposes on a subdomain you’ve a “dev.example.com.php” include file, on the production system example.com the file name resolves to “www.example.com.php”: <?php $baseUrl = “http://example.com”; ?>
Then the menu in sidebar.php looks like: <?php $classVMenu = "vmenu"; print " <img src=\"$baseUrl/vmenuheader.png\" width=\"128\" height=\"16\" alt=\"MENU\" /> <ul> <li><a class=\"$classVMenu\" href=\"$baseUrl/\">Home</a></li> <li><a class=\"$classVMenu\" href=\"$baseUrl/contact.asp\">Contact</a></li> <li><a class=\"$classVMenu\" href=\"$baseUrl/sitemap.asp\">Sitemap</a></li> … </ul> "; ?>
Mixing X/HTML with server sided scripting languages is fault-prone and makes maintenance a nightmare. Don’t make the same mistake as WordPress. Avoid crap like that: <li><a class="<?php print $classVMenu; ?>" href="<?php print $baseUrl; ?>/contact.asp"></a></li>

Error handling

I refuse to discuss IIS error handling. On Apache servers you simply put ErrorDocument directives in your root’s .htaccess file: ErrorDocument 401 /get-the-fuck-outta-here.asp ErrorDocument 403 /get-the-fudge-outta-here.asp ErrorDocument 404 /404handler.php ErrorDocument 410 /410-gone-forever.asp ErrorDocument 503 /410-down-for-maintenance.asp # … Options -Indexes
Then create neat pages for each HTTP response code which explain the error to the visitor and offer alternatives. Of course you can handle all response codes with one single script: ErrorDocument 401 /error.php?errno=401 ErrorDocument 403 /error.php?errno=403 ErrorDocument 404 /404handler.php ErrorDocument 410 /error.php?errno=410 ErrorDocument 503 /error.php?errno=503 # … Options -Indexes
Note that relative URLs in pages or scripts called by ErrorDocument directives don’t work. Don’t use absolute URLs in ErrorDocument directives itself, because this way you get 302 response codes for 404 errors and crap like that. If you cover the 401 response code with a fully qualified URL, your server will explode. (Ok, it will just hang but that’s bad enough.) For more information please read my pamphlet Why error handling is important.

Last but not least create a robots.txt file in the root. If you’ve nothing to hide from search engine crawlers, this one will suffice: User-agent: * Disallow: Allow: /

I’m aware that this tiny guide can’t cover everything. It should give you an idea of the pitfalls and possible solutions. If you’re somewhat code-savvy my code snippets will get you started, but hire an expert when you plan to migrate a large site. And don’t view the source code of link-condom.com pages where I didn’t implement all tips from this tutorial.

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

22 comments Sebastian | 404grabber, Duplicate Content, Redirects, Web development, Copy+Paste-Penalties, .htaccess, IIS, SEO

MSN spam to continue says the Live Search Blog

Posted on 5 December, 2007

It seems MSN/LiveSearch has tweaked their rogue bots and continues to spam innocent Web sites just in case they could cloak. I see a rant coming, but first the facts and news.

Since August 2007 MSN runs a bogus bot faking a human visitor coming from a search results page, that follows their crawler. This spambot downloads everything from a page, that is images and other objects, external CSS/JS files, and ad blocks rendering even contextual advertising from Google and Yahoo. It fakes MSN SERP referrers diluting the search term stats with generic and unrelated keywords. Webmasters running non-adult sites wondered why a database tutorial suddenly ranks for [oral sex] and why MSN sends visitors searching for [MILF pix] to a teenager’s diary. Webmasters assumed that MSN is after deceitful cloaking, and laughed out loud because their webspam detection method was that primitive and easy to fool.

Now MSN admits all their sins -except the launch of a porn affiliate program- and posted a vague excuse on their Webmaster Blog telling the world that they discovered the evil cloakers and their index is somewhat spam free now. Donna has chatted with the MSN spam team about their spambot and reports that blocking its IP addresses is a bad idea, even for sites that don’t cloak. Vanessa Fox summarized MSN’s poor man’s cloaking detection at Search Engine Land:

And one has to wonder how effective methods like this really are. Those savvy enough to cloak may be able to cloak for this new cloaker detection bot as well.

They say that they no longer spam sites that don’t cloak, but reverse this statement telling Donna

we need to be able to identify the legitimate and illegitimate content

and Vanessa

sites that are cloaking may continue to see some amount of traffic from this bot. This tool crawls sites throughout the web — both those that cloak and those that don’t — but those not found to be cloaking won’t continue to see traffic.

Here is an excerpt from yesterdays referrer log of a site that does not cloak, and never did: http://search.live.com/results.aspx?q=webmaster&mrt=en-us&FORM=LIVSOP http://search.live.com/results.aspx?q=smart&mrt=en-us&FORM=LIVSOP http://search.live.com/results.aspx?q=search&mrt=en-us&FORM=LIVSOP http://search.live.com/results.aspx?q=progress&mrt=en-us&FORM=LIVSOP http://search.live.com/results.aspx?q=google&mrt=en-us&FORM=LIVSOP http://search.live.com/results.aspx?q=google&mrt=en-us&FORM=LIVSOP http://search.live.com/results.aspx?q=domain&mrt=en-us&FORM=LIVSOP http://search.live.com/results.aspx?q=database&mrt=en-us&FORM=LIVSOP http://search.live.com/results.aspx?q=content&mrt=en-us&FORM=LIVSOP http://search.live.com/results.aspx?q=business&mrt=en-us&FORM=LIVSOP
Why can’t the MSN dudes tell the truth, not even when they apologize?

Another lie is “we obey robots.txt”. Of course the spambot doesn’t request it to bypass bot traps, but according to MSN it uses a copy served to the LiveSearch crawler “msnbot”:

Yes, this robot does follow the robots.txt file. The reason you don’t see it download it, is that we use a fresh copy from our index. The tool does respect the robots.txt the same way that MSNBot does with a caveat; the tool behaves like a browser and some files that a crawler would ignore will be viewed just like real user would.

In reality, it doesn’t help to block CSS/JS files or images in robots.txt, because MSN’s spambot will download them anyway. The long winded statement above translates to “We promise to obey robots.txt, but if it fits our needs we’ll ignore it”.

Well, MSN is not the only search engine running stealthy bots to detect cloaking, but they aren’t clever enough to do it in a less abusive and detectable way.

Their insane spambot led all cloaking specialists out there to their not that obvious spam detection methods. They may have caught a few cloaking sites, but considering the short life cycle of Webspam on throwaway domains they shot themselves in both feet. What they really have achieved is that the cloaking scripts are MSN spam detection immune now.

Was it really necessary to annoy and defraud the whole Webmaster community and to burn huge amounts of bandwidth just to catch a few cloakers who launched new scripts on new throwaway domains hours after the first appearance of the MSN spam bot?

Can cosmetic changes with regard to their useless spam activities restore MSN’s lost reputation? I doubt it. They’ve admitted their miserable failure five months too late. Instead of dumping the spambot, they announce that they’ll spam away for the foreseeable future. How silly is that? I thought Microsoft is somewhat profit orientated, why do they burn their and our money with such amateurish projects?

Besides all this crap MSN has good news too. Microsoft Live Search told Search Engine Roundtable that they’ll spam our sites with keywords related to our content from now on, at least they’ll try it. And they have a forum and a contact form to gather complaints. Crap on, so much bureaucratic efforts to administer their ridiculous spam fighting funeral. They’d better build a search engine that actually sends human traffic.

Share/bookmark this: del.icio.us • Google • ma.gnolia • Mixx • Netscape • reddit • Sphinn • Squidoo • StumbleUpon • Yahoo MyWeb
Subscribe to

Entries

Comments

All Comments

9 comments Sebastian | Spoofing, MSN, Search Quality, Webspam, Crap, Spam, Cloaking

Sebastian’s Pamphlets

Monthly archive: December, 2007

Vote Now: Rubber Chicken Award 2007 for the dullest and most tedious search blog post

Ping the hell out of Technorati’s reputation algo

Results:

No more RSS feeds in Google’s search results

Nominate a red crab in the 2007 Search Blog Awards!

BlogCatalog needs professional help

Upgrading from IIS/ASP to Apache/PHP

Running /default.asp, /home.asp etc. as PHP scripts

Sanitizing the home page URL by eliminating “default.asp”

Converting the ASP code for server name canonicalization

Fixing case issues like /script.asp?id=value vs. /Script.asp?ID=Value

Tweaking code for future maintenance

Using absolute URLs on different systems

Error handling

MSN spam to continue says the Live Search Blog

Categories

Monthly Archives

Links

RSS Feeds