• SEO Services
  • PPC Management
  • News Content Marketing
« Search Marketing Blog

Three Ways to keep a Web Page out of the Search Engines
Wed, 17 Oct 2007 by Kerry Dye


Recently in the Vertical Leap office, controlling “link juice” (using a more generic  term than Google’s Page Rank) has been a hot topic recently. Therefore, it  seems to be a good time to write a blog about the different ways this can be  done, when you might want to use them and why.

Method 1:  robots.txt

This is a file that you place in the root of your site. When a spider/robot comes to  visit your site, this is the first file that they access. Using the information  here, they look at your site, excluding any pages that you don’t want them to
visit.

For the  purposes of this blog, I am limiting my mention of robots.txt to SEO uses, but it can also be used for directing other robots, such as  content scrapers. For search engine optimisation, we use robots.txt quite  specifically. It is the best way to prevent pages from appearing in the index.  Putting them in robots.txt with the Disallow command will stop a search engine  from looking at a page and putting in the results for searches. If a page is in  the results that you don’t actually want appearing, then adding it to  robots.txt will mean that it is removed, although the reaction isn’t instant.

So why  would you want to keep pages out of the search engines? Well, in SEO the  primary reason is duplicate content. Often, unknown to the site owner or  creator they are unwittingly creating duplicate content. Common routes for this
to happen are printable versions of pages or accessible versions of pages. With  exactly the same content, it is difficult for a search engine to work out which  is the most important version and can thus result in the devaluation all of the
versions, which impacts on your website’s visibility overall.

The  secondary reason for employment of robots.txt is low value pages. Again, this  is normally something done unwittingly as part of the site build process.  Recent examples I’ve seen causing this are mostly forms. Whilst from a usability point of view, making a pre-populated form with a product or a result of a search is great for the user, for a search engine it creates potentially hundreds of similar pages with only a single element that differs.

Method 2: rel=”nofollow” on links

This is where you put this tag into the <a> tag of the relevant link. It is often used to control spamming e.g. by Wikipedia by discouraging people posting links just for the sake of gaining link juice. On some sites it is used for all external links for this reason. From an SEO point of view it can be used to avoid leakage – reducing the number of off-page links which can lower the importance rating of a page.

Also from the point of view of improving search engine rankings this tag can be used for directing page importance. Matt Cutt’s of Google advocates the use of the tag for this reason. Read Joe’s blog for information about using this on internal links to control link juice and the pros and cons.

The disadvantage of the nofollow tag compared to robots.txt is that it is inconsistently applied by the search engines. Whilst it causes Google & MSN/Live to ignore the link completely, Yahoo does follow the link whilst discounting the value and Ask ignores the tag entirely as it is unsupported.  This has been evaluated several times experimentally, and means the nofollow tag is not entirely useful for controlling duplicate content and low value pages. To absolutely ensure something doesn’t list, you need to use robots.txt. For controlling the value of spam and external links, then it sort of works; certainly the fact that many would-be spammers believe that all search engines function the same way as Google and MSN actually means it does discourage bad linking practices, although it also devalues what could be valid links when applied systematically.

Method 3: noindex metatag

Primarily this tag is used as an alternative to robots.txt, especially where you do not have access to the root of the site to change the content of robots.txt, such as a hosted environment.

Whilst this sounds good, unfortunately, just like nofollow, this appears to be handled inconsistently across the search engines.  Peculiarly, this time it’s MSN and Yahoo that show the page, although slightly differently. Unlike nofollow, which is newsworthy because of the spam issue, noindex has almost no other blogs focussing on it apart from this relatively unscientific one from Matt, so there’s no saying that this is actually 100% right with complete confidence. So if you have access to it, robots.txt is the best method to use.

So there you have it regarding what methods are available. I will follow this blog soon with some examples of where Vertical Leap have used these to get better results for our clients.

Leave a Reply


You are our third SEO company in the last six months and there is simply no comparison in the level of service that we have received.

The Online Clinic


Blog Feed Subscription
RSS FeedFollow us on Twitter

 Archives