SEARCH MARKETING BLOG

SEO Speak: What is a Robots file?

Continuing my series on the demystification of SEO terms I thought I’d cover Robots files as another item we speak about frequently.

Basically put a robots file enables you to stop search engines from seeing specific areas of your website and showing them in the SERPs

When a search engine visits your website it will read the contents of your robots file and not view the sections you have included in this file.

A Robots file is a text document that sits in the root directory of your website (e.g. www.yourdomain.com/robots.txt).  A robots file can contain commands for the whole of your website or just one section and can also be aimed at a number of search engine robots.

A Robots file should contain at least one robot instruction, which could be a blank one such as the below:
User-agent: *

Disallow:

Sitemap: http://www.yourdomain.com/sitemap.xml

Avoid adding a / after the Disallow command if there is nothing following this, as this will stop the search engine robots following any sections of your site meaning you won’t get any rankings in the SERPs.

In many cases there are files on a server that you would want to disallow from being crawled such as the CGI folder or your shopping cart admin section.  As shown in the example below.

User-agent: *

Disallow: /cgi

Sitemap: http://www.yourdomain.com/sitemap.xml

You can use a robots file to remove a single page from the index of search engines if it contains information you do not wish the crawlers to see, for example a duplicate of another page – www.yourdomain.com/product23.htm  might be the same as www.yourdomain.com/scentedcandle.htm.  Product 23 is not a very SEO friendly URL but Scented Candle is, so it would be good to remove duplication by banning the product 23 URL from robots.  In this case you would add the following to Robots:

User-agent: *

Disallow: /product23.html

Sitemap: http://www.yourdomain.com/sitemap.xml

It is also possible to ban a section of your URLs for example any of the above product codes which you would implement using a catch all wildcard which would ban anything containing the product name:
User-agent: *

Disallow: *product

Sitemap: http://www.yourdomain.com/sitemap.xml

When using the above wildcards you have to be careful that the disallow element you have included does not ban pages on your site you want the crawler to see.  For example if you had a page www.yourdomain.com/productsandservices this might be disallowed

If you want to give a specific robot an instruction – for example to remove a page from their index then you can all the user-agent (or robot) by name.

User-agent: * relates to all engines

User-agent: googlebot relates to Google’s robot

User-agent: msnbot relates to MSN’s robot

User-agent: slurp related to Yahoo’s robot

It is good practice to include a sitemap.xml reference to the robots file as in the above examples, so that your robots file shows pages that can be viewed as well as those not to be viewed.

This entry was posted in SEO Blog and tagged , by Emily Mace. Bookmark the permalink.

About Emily Mace

Emily joined Vertical Leap as an SEO Campaign Delivery Manager in 2008, having gained wide search marketing experience as a web developer, SEO specialist and trainer for local Government departments and Tourism South East. Emily gained Google Analytics Individual Qualification in 2011, and regularly blogs on the technical aspects of SEO, sharing her expertise with our readers.