Three Ways to keep a Web Page out of the Search Engines
17th October 2007 by Kerry Dye
Recently in the Vertical Leap office, controlling “link juice” (using a more generic term than Google’s Page Rank) has been a hot topic recently. Therefore, it seems to be a good time to write a blog about the different ways this can be done, when you might want to use them and why.
Method 1: robots.txt
This is a file that you place in the root of your site. When a spider/robot comes to visit your site, this is the first file that they access. Using the information here, they look at your site, excluding any pages that you don’t want them to
visit.
For the purposes of this blog, I am limiting my mention of robots.txt to SEO uses, but it can also be used for directing other robots, such as content scrapers. For search engine optimisation, we use robots.txt quite specifically. It is the best way to prevent pages from appearing in the index. Putting them in robots.txt with the Disallow command will stop a search engine from looking at a page and putting in the results for searches. If a page is in the results that you don’t actually want appearing, then adding it to robots.txt will mean that it is removed, although the reaction isn’t instant.
So why would you want to keep pages out of the search engines? Well, in SEO the primary reason is duplicate content. Often, unknown to the site owner or creator they are unwittingly creating duplicate content. Common routes for this
to happen are printable versions of pages or accessible versions of pages. With exactly the same content, it is difficult for a search engine to work out which is the most important version and can thus result in the devaluation all of the
versions, which impacts on your website’s visibility overall.
The secondary reason for employment of robots.txt is low value pages. Again, this is normally something done unwittingly as part of the site build process. Recent examples I’ve seen causing this are mostly forms. Whilst from a usability point of view, making a pre-populated form with a product or a result of a search is great for the user, for a search engine it creates potentially hundreds of similar pages with only a single element that differs.
Method 2: rel=”nofollow” on links
This is where you put this tag into the <a> tag of the relevant link. It is often used to control spamming e.g. by Wikipedia by discouraging people posting links just for the sake of gaining link juice. On some sites it is used for all external links for this reason. From an SEO point of view it can be used to avoid leakage – reducing the number of off-page links which can lower the importance rating of a page.
Also from the point of view of improving search engine rankings this tag can be used for directing page importance. Matt Cutt’s of Google advocates the use of the tag for this reason. Read Joe’s blog for information about using this on internal links to control link juice and the pros and cons.
The disadvantage of the nofollow tag compared to robots.txt is that it is inconsistently applied by the search engines. Whilst it causes Google & MSN/Live to ignore the link completely, Yahoo does follow the link whilst discounting the value and Ask ignores the tag entirely as it is unsupported. This has been evaluated several times experimentally, and means the nofollow tag is not entirely useful for controlling duplicate content and low value pages. To absolutely ensure something doesn’t list, you need to use robots.txt. For controlling the value of spam and external links, then it sort of works; certainly the fact that many would-be spammers believe that all search engines function the same way as Google and MSN actually means it does discourage bad linking practices, although it also devalues what could be valid links when applied systematically.
Method 3: noindex metatag
Primarily this tag is used as an alternative to robots.txt, especially where you do not have access to the root of the site to change the content of robots.txt, such as a hosted environment.
Whilst this sounds good, unfortunately, just like nofollow, this appears to be handled inconsistently across the search engines. Peculiarly, this time it’s MSN and Yahoo that show the page, although slightly differently. Unlike nofollow, which is newsworthy because of the spam issue, noindex has almost no other blogs focussing on it apart from this relatively unscientific one from Matt, so there’s no saying that this is actually 100% right with complete confidence. So if you have access to it, robots.txt is the best method to use.
So there you have it regarding what methods are available. I will follow this blog soon with some examples of where Vertical Leap have used these to get better results for our clients.
Related Posts
- White House opens up website to search engines
- Google page removal requests
- 3 ways to stop your web pages being indexed
- SEO Speak: How do search engines find my pages?
- Search Engines are blind
- No more submitting site maps to Google and other search engines!
- 5 ways to make your SEO Agency even happier
- Using the nofollow tag for internal links
Leave a Comment
By submitting a comment here you grant Vertical Leap's Search Engine Marketing Blog a perpetual license to reproduce your words and name/web site in attribution. Inappropriate or irrelevant comments will be removed at an admin's discretion.