| Three Ways to keep a Web Page out of the Search Engines |
| Wed, 17 Oct 2007 by Kerry Dye Recently in the Vertical Leap office, controlling “link juice” (using a more generic term than Google’s Page Rank) has been a hot topic recently. Therefore, it seems to be a good time to write a blog about the different ways this can be done, when you might want to use them and why. Method 1: robots.txt This is a file that you place in the root of your site. When a spider/robot comes to visit your site, this is the first file that they access. Using the information here, they look at your site, excluding any pages that you don’t want them to For the purposes of this blog, I am limiting my mention of robots.txt to SEO uses, but it can also be used for directing other robots, such as content scrapers. For search engine optimisation, we use robots.txt quite specifically. It is the best way to prevent pages from appearing in the index. Putting them in robots.txt with the Disallow command will stop a search engine from looking at a page and putting in the results for searches. If a page is in the results that you don’t actually want appearing, then adding it to robots.txt will mean that it is removed, although the reaction isn’t instant. So why would you want to keep pages out of the search engines? Well, in SEO the primary reason is duplicate content. Often, unknown to the site owner or creator they are unwittingly creating duplicate content. Common routes for this The secondary reason for employment of robots.txt is low value pages. Again, this is normally something done unwittingly as part of the site build process. Recent examples I’ve seen causing this are mostly forms. Whilst from a usability point of view, making a pre-populated form with a product or a result of a search is great for the user, for a search engine it creates potentially hundreds of similar pages with only a single element that differs. Method 2: rel=”nofollow” on links This is where you put this tag into the <a> tag of the relevant link. It is often used to control spamming e.g. by Wikipedia by discouraging people posting links just for the sake of gaining link juice. On some sites it is used for all external links for this reason. From an SEO point of view it can be used to avoid leakage – reducing the number of off-page links which can lower the importance rating of a page. Also from the point of view of improving search engine rankings this tag can be used for directing page importance. Matt Cutt’s of Google advocates the use of the tag for this reason. Read Joe’s blog for information about using this on internal links to control link juice and the pros and cons. The disadvantage of the nofollow tag compared to robots.txt is that it is inconsistently applied by the search engines. Whilst it causes Google & MSN/Live to ignore the link completely, Yahoo does follow the link whilst discounting the value and Ask ignores the tag entirely as it is unsupported. This has been evaluated several times experimentally, and means the nofollow tag is not entirely useful for controlling duplicate content and low value pages. To absolutely ensure something doesn’t list, you need to use robots.txt. For controlling the value of spam and external links, then it sort of works; certainly the fact that many would-be spammers believe that all search engines function the same way as Google and MSN actually means it does discourage bad linking practices, although it also devalues what could be valid links when applied systematically.
Method 3: noindex metatag Primarily this tag is used as an alternative to robots.txt, especially where you do not have access to the root of the site to change the content of robots.txt, such as a hosted environment. Whilst this sounds good, unfortunately, just like nofollow, this appears to be handled inconsistently across the search engines. Peculiarly, this time it’s MSN and Yahoo that show the page, although slightly differently. Unlike nofollow, which is newsworthy because of the spam issue, noindex has almost no other blogs focussing on it apart from this relatively unscientific one from Matt, so there’s no saying that this is actually 100% right with complete confidence. So if you have access to it, robots.txt is the best method to use. So there you have it regarding what methods are available. I will follow this blog soon with some examples of where Vertical Leap have used these to get better results for our clients. |
Leave a Reply |
- March 2010
- February 2010
- January 2010
- December 2009
- November 2009
- October 2009
- September 2009
- August 2009
- July 2009
- June 2009
- May 2009
- April 2009
- March 2009
- February 2009
- January 2009
- December 2008
- November 2008
- October 2008
- September 2008
- August 2008
- July 2008
- June 2008
- May 2008
- April 2008
- March 2008
- February 2008
- January 2008
- December 2007
- November 2007
- October 2007
- September 2007
- August 2007
- July 2007
- June 2007
- May 2007
- April 2007
- March 2007
- February 2007



