The format (syntax) of the robots.txt file has to be followed. It consists of records that have 2 fields. The first is the "User-agent Line" the second is one or more "Disallow Line(s)"
Syntax is <field> ":" <value>
You should create the text in UNIX line ender mode. Good text editors or one of the Linux line editors work.
***WARNING***
Do not use your HTML editor to create the robots.txt field unless it have a "text mode" to edit in. Notepad can be used if your FTP client have
First we talk about the "User Agents"
We have all seen "Googlebot" who hangs out here enough to be a Moderator so we use that as an example
The useragent line specifies the bot that this record is for:
User-agent: Googlebot
or you can specify all bots/spiders using wildcard "*":
User-agent: *
There are lists of all known User-agents on the net and you can check your log files to find User-agents that hit your site.
Now comes the second part of the record, the "Disallow"
It may have one or more lines depending on how much restriction you need. It's called "Disallow" because you allow all useragents to harvest all files and folders unless you specify otherwise. This example tells useragents that they are not allowed to harvest from the file my_email.html:
Disallow: my_email.html
To tell the useragents that a directory is off limit you have to use this format:
Disallow: /cgi-bin/
Now useragents (conforming bots and spiders) will stay out of that directory.
Can I use wildcards when setting up the rules? In a way you can because if you use:
Disallow: /private
This would block /private.html and /private/index.html and any other files receding in the /private/ folder. Don't put disallow: alone on a line, because the blank space is interpreted as "good to go, no restrictions" unless that is what you are trying to do.
This brings us to the next step because we as "professional" web designers put comments on everything we do so we can figure out what was going on when we wrote this file. You can put comments in the robots/txt file by starting the comment with #. One word of caution, put the comment on a line by itself because a lot of useragents don't interpret white-space. The:
"Disallow: /images/ #don’t need to see my wedding pictures".
Will not stop all useragents from harvesting from the /images/ directory because they read this as:
"Disallow: /image/#don’tneedtoseemyweddingpictures"
and will go on into the /images/ directory. So put the comment on the next line after the disallow statement.
A couple of examples to round out this tutorial:
In my web-site here at Astahost I use the simplest form of the robots.txt file because I use the web-site for testing concepts and don't need it harvested. You can see what it looks like here: http://www.ngc.astahost.com/robots.txt
CODE
#First example
User-agent: *
Disallow: /
#This keeps all robots out (the setting I use now)
#Second example
User-agent: *
Disallow:
#This allows the useragent to harvest all files and folders.
#Third example
User-agent: *
Disallow: /images/
Disallow: /cgi-bin/
Disallow: /my_mail.html
#This will block all boots from the files and folders inside
#the images and cgi-bin folder. The file my_mail.html will
#not be harvested.
#Fourth example
User-agent: Anthill
Disallow: /pricelist/
#This will block the Anthill useragent from accessing your
#pricelist folder and all files in that directory. (Anthill
#is used to gather price-information automatically from online
#stores. Support for international versions.)
#Fifth example
User-agent: linklooker
Disallow: /
#This is a new bot, that is not registered so who know what
#the data is collected for.
List on known bots and spiders with detailed information and description.
If you are allowing some bots but not all then make sure you list the bots you allow first then the deny.
CODE
#Allow googlebot, msnbot, and askjeeves to harvest all files and folders.
User-agent: googlebot
Disallow:
User-agent: msnbot
disallow:
User-agent: askjeeves
Disallow:
#The rest of the bots and spiders are blocked
User-agent: *
Disallow: /
This are a few examples of how a simple robots.txt file is setup, there are a lot of complicated configurations out there and even some sites that have put the useragent and disallow statement backward.
There are a lot of good tutorials on robots.txt and you can do a search on Google to find them. Here is a link to TheBigCrawl – checking robots.txt files for errors.
If you took the time to create a robots.txt file, take the time to check it
Hope this is to some help.
Nils


