Go to the Cre8pc blog
Creating Blog Vibes Since July 2002
Image of Kim laughing.:: Usability, SEO and Web Design ::

Intriguing blab about usability, seo/sem, web dev, search engines, Cre8asiteforums and Internet-life stuff.

Warmest Greetings,

 

:: Tuesday, August 20, 2002 ::

Robots. Spiders. And Google Bears. Oh My!

(The following is reprinted by permission of the author. Originally posted to the Cre8asite Forum.)

Okey dokey. The robots.txt file is a text that contains rules to exclude spiders from getting certain named files and the contents of certain named directories. There are no rules to tell spiders what they CAN do - just what they CANNOT do. We probably all know about the file and what it's for, but if you are not familiar with it, read http://www.searchengineworld.com/robots/robots_tutorial.htm. Ignore the second sentence on the page. It's wrong. The file tells spiders what they cannot index and not what they "can download".

The robots.txt protocol, which is voluntary, tells spiders what they cannot index. It doesn't mean that they cannot download the files and look at them if they want to. In fact, robots.txt can be used as signpost to files and directories that spiders would otherwise have no way of knowing about. I mention this because it is possible for a spam hunting spider to examine the files that we didn't want it to see. But that's another story.

The robots.txt file must be contained in the site's root directory, as that is the only place that spiders look for it. It's up to the webmaster to create the file. Servers don't do it. It isn't something that is automatically included with hosting.

The spiders of major engines will always request the file in case it contains any rules for them. Many, if not most, sites don't make use of the file at all and, therefore, it doesn't exist. Many servers will return the default 403 page (Forbidden) when a requested page doesn't exist so, since the spiders are requesting a non-existant page, they, like everyone else, will get the 403 page.

Until recently, Google's spider would not continue to crawl any website that returned a 403 instead of the robots.txt file. Now it does crawl those sites. There was never any reason for Google not to crawl sites when the robots.txt file didn't exist, but for some reason they made the decision to do it that way.

When Kim asked why her client's sites were not being crawled, I said that some of mine had been waiting from 6 to 12 months. Google had sniffed at them by getting the index page and the robots.txt file but never crawled any further. I honestly thought that there was some sort of semi-ban on the IP address.

Now it all makes sense. My sites were fully crawled a couple of months ago and Kim reported that her client's sites had also been crawled in the same cycle. A friend of mine (Grace) told me that exactly the same thing happened with some of her sites.

Why on earth they would choose not to crawl sites where there was no robots.txt file is beyond reason. But, thankfully, they do crawl them now. It would have been nice to know about it in the past though.

Posted by Phil Craven of WebWorkshop.net

:: posted by Kim Krause Berg on 8/20/2002 08:57:12 PM

:: Today's Post Permalink | Back to the BLOG Home
:: Website Evaluations
:: Email this Post
:.................................

Feed Bin



Google Reader
del.icio.us Usability, SEO and Web Design
Add to My Yahoo!
Subscribe with Bloglines
Subscribe in NewsGator Online

myFeedster
Add to My AOL
Furl Usability, SEO and Web Design



 View My Public Stats on MyBlogLog.com

Text Link Ads Banner

Usability Education
Cre8pc's Squidoo Lenses


Crooked sunglasses


My artistic friends love this picture.


Self-Esteem on Steroids

Recent Posts

Monthly Archives

It's That Book Again

Conversions Topic is New York Times Best Seller (Seriously)

Kim is a Member of the Usability Professionals Association

UPA - Usability Professionals' Association

About Kim's Web Site Usability Reviews