Google pulls CIA data

Bollocks devoted to Daniel Brandt

Playing with 71 new search keywords by Google I noticed that many search queries get answered with CIA data. Try “national holiday” Canada, Germany, France, Italy and so on, in all cases you get directed to the CIA factbook.



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

How Google’s Web Spam Team finds your link scheme

Natural Search Blog has a nice piece reporting that Matt’s team makes use of a proprietary tool to identify webspam trying to manipulate Google’s PageRank.

Ever wondered why Google catches PR-boosting services scams in no time?



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Why proper error handling is important

Misconfigured servers can prevent search engines from crawling and indexing. I admit that’s news of yesterday. However, standard setups and code copied from low quality resources are underestimated –but very popular– points of failure. According to Google a missing robots.txt file in combination with amateurish error handling can result in invisibility on Google’s SERPs. That’s a very common setup by the way.

Googler Jonathon Simon said:

This way [correct setup] when the Google crawler or other search engine checks for a robots.txt file, they get a 200 response if the file is found and a 404 response if it is not found. If they get a 200 response for both cases then it is ambiguous if your site has blocked search engines or not, reducing the likelihood your site will be fully crawled and indexed.

That’s a very carefully written warning, so I try to rephrase the message between the lines:

If you have no robots.txt and your server responds “Ok” (or 302 on a request of robots.txt followed by a 200 response on request of the error page) when Googlebot tries to fetch it, Googlebot might not be willing to crawl your stuff further, hence your pages will not make it in Google’s search index.

If you don’t suffer from IIS (Windows hosting is a horrible nightmare coming with more pitfalls than countable objects in the universe: go find a reliable host) here is a bullet-proof setup.

If you don’t have a robots.txt file yet, create one and upload it today:

User-agent: *
Disallow:

This tells crawlers that your whole domain is spiderable. If you want to exclude particular pages, file-types or areas of your site, refer to the robots.txt manual.

Next look at the .htaccess file in your server’s Web root directory. If your FTP client doesn’t show it, add “-a” to “external mask” in the settings and reconnect. If you find complete URLs in lines starting with “ErrorDocument”, your error handling is screwed up. What happens is that your server does a soft redirect to the given URL, which probably responds with “200-Ok”, and the actual error code gets lost in cyberspace. Sending 401 errors to absolute URLs will slow your server down to the performance of a single IBM-XT hosting Google.com, all other error directives pointing to absolute URLs result in crap. Here is a well formed .htaccess sample:

ErrorDocument 401 /get-the-fuck-outta-here.html
ErrorDocument 403 /get-the-fudge-outta-here.html
ErrorDocument 404 /404-not-found.html
ErrorDocument 410 /410-gone-forever.html
Options -Indexes
<Files “.ht*”>
deny from all
</Files>
RewriteEngine On
RewriteCond %{HTTP_HOST} !^www\.canonical-server-name\.com [NC]
RewriteRule (.*) http://www.canonical-server-name.com/$1 [R=301,L]

With “ErrorDocument” directives you can capture other clumsiness as well, for example 500 errors with /server-too-buzzy.html or so. Or make the error handling comfortable using /error.php?errno=[insert err#]. In any case avoid relative URLs (src attribute in IMG elements, CSS/feed links, href attributes of A elements …) on all landing pages. You can test actual HTTP response codes with online header checkers.

The other statements above do different neat things. Options -Indexes disallows directory browsing, the next block makes sure that nobody can read your server directives, and the last three lines redirect invalid server names to your canonical server address.

.htaccess is a plain ASCII file, it can get screwed when you upload it in binary mode or when you change it with a word processor. Best edit it with an ASCII/ANSI editor (vi, notepad) as htaccess.txt on your local machine (most FTP clients choose ASCII mode for text files) and rename it to “.htaccess” on the server. Keep in mind that file names are case sensitive.

Tags: ()



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Getting Help and Answers from Google

For webmasters and publishers not having Googlers on their IM buddy list or in their email address book, Google has opened a communication channel for the masses. Google’s Webmaster Blog is open for webmaster comments, and Googlers answer crawling and indexing related questions in Google’s Webmaster Help Central. Due to the disadvantages of snowboarding participation of Googlers in the forum slowed down a bit lately, but things are going to evolve to the better as I’ve recognized.

As great as all these honest efforts to communicate with webmasters are, large user groups come with disadvantages like trolling and more noise than signal. So I’ve tried to find ways to make Google’s Webmaster Forums more useful. Since the Google Groups platform doesn’t offer RSS feeds for search results, I tried to track particular topics and authors as well with Google’s blog search. This experiment turned out to miserable failure.

Tracking discussions via web search is way to slow because time to index reaches a couple days, not minutes or hours like with blog search or news search. The RSS feeds provided contain all the noise and trolling I don’t want to see, they don’t even come with useful author tags, so I needed a simple and stupid procedure to filter RSS feeds with Google Reader. I thought I’d use Yahoo pipes to create the filters, and this worked just fine as long as I viewed the RSS output as source code or formated by Yahoo. Seems today is my miserable failure day: Google Reader told me my famous piped feeds contain zero items, no title, nor all the neat stuff I’ve seen seconds ago in the feed’s source. Aaaahhhrrrrgggg … I’m going back to track threads (missing lots of valuable post due to senseless thread titles or topic changes within threads) and profiles, for example Adam Lasnik (Google’s Search Evangelist), John Mueller (Softplus), Jonathan Simon (Google), Maile Ohye (Google), Thu Tu (Google), Vanessa Fox (Google) and Google is awesome, not perfect but still awesome. Seems my intention (constructive criticism) got obscured by my sometimes weird sense of humor and my preference for snaky irony and exaggeration to bring a point home.

Update July/05/2007: Google has fixed the broken RSS feeds.

Tags: ()



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Google Blog Search Banned Legit Webmaster Forum

I’ve been able to get all sorts of non-blog stuff onto the SERPs of Google’s blog search in the past. However, my attempt to get contents hosted by Google into blog search is best described as miserable failure. Although Google Blog Search BETA delivers results from all kind of forums, it obviously can’t deal with threaded content from a source which recently got rid of its BETA stage.

First I’ve tried to ping blog search, submitted feeds, linked to threads from here and in a feed regulary fetched for blog search as well. No results. No robots.txt barriers or noindex tags, just badly malformed code but Google’s bot can eat not properly closed alternate links pointing to an RSS feed … drove me nuts. Must be a ban or at least a heavy troll-penalty I thought, went to Yahoo, masked the feed URLs, submitted again but no avail.

Try for yourself, submit a feed to Google Blog Search, then use a somewhat unique thread title and do a blog search. Got zilch too? Try a web search to double check that the content is crawlable. It is. Conclusion? Google banned its very own Google Groups.

Too sad, poor PageRank addicts running blog searches will miss out on tidbits like this quote from Google’s Adam Lasnik, asked why URLs blocked from crawlers show toolbar-PR:

As for the PR showing… it’s my understanding that the toolbar is using non-private info (PR data from other pages in that domain) to extrapolate/infer/guess a PR for that page :).

Tags: ()



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Code Monkey Very Simple Man/Woman

Rcjordan over at Threadwatch pointed me to a nice song perfectly explaining romors like “Google’s verification tags get you into supplemental hell” and thoughtless SEO theories like “self-closing meta tags in HTML 4.x documents and uppercase element/attribute names in XHTML documents prevent search engine crawlers from indexing”. You don’t believe such crappy “advice” can make it to the headlines? Just wait for an appropiate thread at your preferred SEO forum picked by a popular but technically challenged blogger. This wacky hogwash is the most popular lame excuse for MSSA issues (aka “Google is broke coz my site sitting at top10 positions since the stone age disappeared all of a sudden”) at Google’s very own Webmaster Central.

Here is a quote:

“The robot [search engine crawler] HAS to read syntactically … And I opt for this explanation exactly because it makes sense to me [the code monkley] that robots have to be dilligent in crawling syntactically in order to do a good job of indexing … The old robots [Googlebot 2.x] did not actually parse syntactically - they sucked in all characters and sifted them into keywords - text but also tags and JS content if the syntax was broken, they didn’t discrimnate. Most websites were originally indexed that way. The new robots [Mozilla compatible Googlebot] parse with syntax in mind. If it’s badly broken (and improper closing of a tag in the head section of a non-xhtml dtd is badly broken), they stop or skip over everything else until they find their bearings again. With a broken head that happens the </html> tag or thereabouts”.

Basically this means that the crawler ignores the remaining code in HEAD or even hops to the end of the document not reading the page’s contents.

In reality search engine crawlers are pretty robust and fault tolerant, designed to eat and digest the worst code one can provide. These Google’s Sandbox“).

Just hire code monkeys for code monkey tasks, and SEOs for everything else ;)

Tags: ()



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Finally dumping M$-Office…

… in favour of “Google Office”, err, Google Apps.



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Hapless Structures and Weak Linkage

Michael Martinez over at SEO-Theory (moved!) has a nice write-up on how to get crawled and indexed. The post titled “Search engine love: now they crawl me, now they don’t” discusses the importance of internal linkage, PageRank distribution, and Google’s recent architectural changes — topics which are “hot” in Google’s Webmaster Help Center, where I hang out every now and then. I thought I blog Michael’s nice essay as sort of multi-link-bookmark making link drops easier, so here is some of my stuff related to crawling and indexing:

About Google’s Toolbar-PageRank
High PageRank leads to frequent crawling, but nonetheless ignore green pixels.

The Top-5 Methods to Attract Search Engine Spiders
Get deep links to great content.

Supporting search engine crawling
The syntax of a search engine friendly Web site.

Web Site Structuring
Do’s and don’ts on information architectures.

Optimizing Web Site Navigation
Tweak your UI for users to make it crawler friendly.

Linking is All About Popularity and Authority
LOL: Link out loud.

Related information

Tags: ()



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Interested in buying a text link

Today I give up on answering emails like this one:

Hello,

First of all I would like to introduce my company as one of the best web hosting service provider from [country] named [link]. We are in the hosting business since 2004 and have more than 3000 satisfied customers.

We are having PR -6 and an alexa ranking of 63,697

We are interested to purchase a link at your site, please provide us with a suitable quotation.

Waiting for your kind reply.

Regards,
[Name, Company …]

Besides the fact that a page claiming a PageRank of minus six most probably is not that kind of neighborhood I’d tend to link out to, it’s a kinda stupid attempt.

Not only the page where the contact link was clicked is in no way related to web hosting services (it just triggers a few green pixels in the Google toolbar). Each and every page on this topic has a link leading to my take on paid links, which does not encourage link monkey business, so to say.

My usual reply to such emails was “Thanks for writing, you can buy a nofollow’ed link marked as advertising for a low as [tiny monthly fee] when you suggest a page on my site which is relevant to yours and I like what you provide to your visitors/users” plus an explanation of the link condom. No takers.

The message above is from a clown abusing my contact form today, so I guess it’s OK to quote it. It is however symptomatic, there are lots of folks out there who still believe that fooling the engines is that simple. I admit it can be done, but I’m with Eric Ward who says it’s not worth it.

Tags: ()



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

Priceless SEO Advice

Just stumbled upon: If you are too stupid to use a computer you might try giving SEO advice. Best business plan ever, for idots ;)

My favourite:

Q: Why are SEO Consultants too expensive for webmasters?
A: I personally used a firm that did a 250,000 site submissions for my site, it worked great.

Tags: ()



Share/bookmark this: del.icio.usGooglema.gnoliaMixxNetscaperedditSphinnSquidooStumbleUponYahoo MyWeb
Subscribe to      Entries Entries      Comments Comments      All Comments All Comments
 

« Previous Page  1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29  Next Page »