< Back to Blog

Three Ways to keep a Web Page out of the Search Engines
Wed, 17 Oct 2007 12:34:24 by Kerry Dye

Recently in the Vertical Leap office, controlling "link juice" (using a more generic term than Google's Page Rank) has been a hot topic recently. Therefore, it seems to be a good time to write a blog about the different ways this can be done, when you might want to use them and why.

Method 1: robots.txt

This is a file that you place in the root of your site. When a spider/robot comes to visit your site, this is the first file that they access. Using the information here, they look at your site, excluding any pages that you don't want them to visit.

For the purposes of this blog, I am limiting my mention of robots.txt to SEO uses, but it can also be used for directing other robots, such as content scrapers. For search engine optimisation, we use robots.txt quite specifically. It is the best way to prevent pages from appearing in the index. Putting them in robots.txt with the Disallow command will stop a search engine from looking at a page and putting in the results for searches. If a page is in the results that you don't actually want appearing, then adding it to robots.txt will mean that it is removed, although the reaction isn't instant. 

So why would you want to keep pages out of the search engines? Well, in SEO the primary reason is duplicate content. Often, unknown to the site owner or creator they are unwittingly creating duplicate content. Common routes for this to happen are printable versions of pages or accessible versions of pages. With exactly the same content, it is difficult for a search engine to work out which is the most important version and can thus result in the devaluation all of the versions, which impacts on your website's visibility overall.

The secondary reason for employment of robots.txt is low value pages. Again, this is normally something done unwittingly as part of the site build process. Recent examples I've seen causing this are mostly forms. Whilst from a usability point of view, making a pre-populated form with a product or a result of a search is great for the user, for a search engine it creates potentially hundreds of similar pages with only a single element that differs. 

Method 2: rel="nofollow" on links

This is where you put this tag into the <a> tag of the relevant link. It is often used to control spamming e.g. by Wikipedia by discouraging people posting links just for the sake of gaining link juice. On some sites it is used for all external links for this reason. From an SEO point of view it can be used to avoid leakage - reducing the number of off-page links which can lower the importance rating of a page.

Also from the point of view of improving search engine rankings this tag can be used for directing page importance. Matt Cutt's of Google advocates the use of the tag for this reason. Read Joe's blog for information about using this on internal links to control link juice and the pros and cons.

The disadvantage of the nofollow tag compared to robots.txt is that it is inconsistently applied by the search engines. Whilst it causes Google & MSN/Live to ignore the link completely, Yahoo does follow the link whilst discounting the value and Ask ignores the tag entirely as it is unsupported. This has been evaluated several times experimentally, and means the nofollow tag is not entirely useful for controlling duplicate content and low value pages. To absolutely ensure something doesn't list, you need to use robots.txt. For controlling the value of spam and external links, then it sort of works; certainly the fact that many would-be spammers believe that all search engines function the same way as Google and MSN actually means it does discourage bad linking practices, although it also devalues what could be valid links when applied systematically. 

Method 3: noindex metatag

Primarily the noindex metatag is used as an alternative to robots.txt, especially where you do not have access to the root of the site to change the content of robots.txt, such as a hosted environment. 

Whilst this sounds good, unfortunately, just like nofollow, this appears to be handled inconsistently across the search engines. Peculiarly, this time it's MSN and Yahoo that show the page, although slightly differently. Unlike nofollow, which is newsworthy because of the spam issue, noindex has almost no other blogs focussing on it apart from this relatively unscientific one from Matt, so there's no saying that this is actually 100% right with complete confidence. So if you have access to it, robots.txt is the best method to use.

So there you have my take on what methods are available. I will follow this blog soon with some examples of where Vertical Leap have used these to get better results for our clients.



Kerry Dye
Campaign Delivery Manager


Subscribe

Archives

Related Blogs
MSN/Live Search Engine gets new facelift
Thu, 8 May 2008 08:53:22 by Matt Hopkins
Bad Linking and the Worlds Worst SERP Snippet
Tue, 6 May 2008 12:48:23 by Joe Bursell
Spam Factory Websites
Fri, 2 May 2008 14:16:40 by Joe Bursell
3 Way Link Exchange Being Abused by Unethical SEOs
Fri, 2 May 2008 13:22:34 by Kerry Dye
Warning: Two Ways of Killing Your Organic Rankings in Google
Wed, 30 Apr 2008 15:14:00 by Kerry Dye
How Local Search will change how we use the Internet
Tue, 29 Apr 2008 14:29:29 by Matt Hopkins
Google's Website Optimizer Tool Now Available to All
Mon, 21 Apr 2008 14:39:44 by Matt Hopkins