17 November 2009 2 Comments

Robots.txt SEO Techniques

http://markbeljaars.com/wp-content/plugins/sociofluid/images/digg_48.png http://markbeljaars.com/wp-content/plugins/sociofluid/images/reddit_48.png http://markbeljaars.com/wp-content/plugins/sociofluid/images/stumbleupon_48.png http://markbeljaars.com/wp-content/plugins/sociofluid/images/delicious_48.png http://markbeljaars.com/wp-content/plugins/sociofluid/images/furl_48.png http://markbeljaars.com/wp-content/plugins/sociofluid/images/technorati_48.png http://markbeljaars.com/wp-content/plugins/sociofluid/images/facebook_48.png http://markbeljaars.com/wp-content/plugins/sociofluid/images/yahoobuzz_48.png http://markbeljaars.com/wp-content/plugins/sociofluid/images/mixx_48.png http://markbeljaars.com/wp-content/plugins/sociofluid/images/twitter_48.png

This post is a long but important one. I recommend you grab a cup of hot chocolate before your start :)

If you have not heard of the robots.txt file, it is simply a small file located in your website root directory that instructs search engines on what they can and can’t do. Although not strictly enforced, search engine bots will generally respect the rules set forward in the robots.txt file. With a properly configured robots.txt file you can, for example, attempt to fend off spam bots, tell google not to index your images or instruct bots to skip pages that may contain duplicate content.

Bots are pieces of software used by search engine companies, spammers and content accumulators to crawl the internet to find new or modified content. A bot’s job is to follow links on a website crawling from page to page and site to site. It’s kind of like a Six Degrees of Kevin Bacon thing. Follow enough links and you should eventually find all the content on the net. This is why backlinks are so important. The more backlinks you have, the easier it is for search engines to find your content. There are literally millions of bot instances trawling the net at any one time. The official term for a bot is a user-agent of which there are thousands. Lets take Google for example. Google has many different user-agents used to index your site, extract images and videos, find news feeds, find mobile phone content, check your site for Adsense quality and so on. This site details a complete list of known user-agents.

The robots.txt file has been around for ages. It was actually introduced by AltaVista in 1994, but now remains a staple food for web spiders. For a complete description of the file and its standard notation, visit here. In short, a robots.txt file can restrict specific bots from crawling your entire site or part thereof. To do this, all bots have a special signature. For example,Google’s index bot is called Googlebot, Bing’s bot is called MSNbot, and Yahoo’s bot is called Yahoo! Slurp.

An entry in the Robots.txt file may look like this:

User-Agent: Yahoo! Slurp
Allow: /public*/
Disallow: /*_print*.html

Here we are telling the Slurp user agent that it can access all pages located in any directory starting with “public”, and have no access to pages with “_print” in the URI.

Below is a complete robots.txt file for one of my experimental WordPress sites (I’ll post an article explaining what I mean by experimental site another day). Astute readers may note that I am disallowing all user agents from specific directories, and only allowing some specific user agents access to the remaining areas of my site. A recent update to the standard also allows me to list the location of my site map to help search engines find all of my pages.

User-agent: *
Disallow: /cgi-bin
Disallow: /wp-admin
Disallow: /wp-includes
Disallow: /wp-content
Disallow: /search/*/feed
Disallow: /search/*/*

User-agent: Mediapartners-Google
Allow: /

User-agent: Adsbot-Google
Allow: /

User-agent: Googlebot-Image
Allow: /

User-agent: Googlebot-Mobile
Allow: /

User-agent: Mediapartners-Google
Allow: /

User-agent: Adsbot-Google
Allow: /

User-agent: Googlebot-Image
Allow: /

User-agent: Googlebot-Mobile
Allow: /

Sitemap: http://beginnerchess.org/sitemap.xml

Disallowing bots from accessing content not intended for consumption will ensure that your site will remain keyword optimized on all pages, thus helping promote your site within the search engine rankings. Say for example you have worked hard at optimizing all pages for the keyword “weight gain” and the various long tails. Your work may be filtered down in the eyes of the search engine if it was able to crawl your login page, privacy page and contact form.

Some SEO experts also argue that Google punishes young websites in favor of older more established sites. Google apparently uses the Internet Archive (found here) to determine the age of a site. If it cannot find the site in the archive, it apparently assumes the site is a certain age. For this reason, many people actively stop the Internet Archive user-agent from indexing their site. This can be done by including the following lines:

User-agent: ia_archiver-web.archive.org
Disallow: /

You may want to also stop image bots from accessing your pictures if they have borrowed non-stock images from other sites. This can be done like so:

User-agent: Googlebot-Image
Allow: /

Finally, robots.txt can be used to exclude bots from specific pages that may be used to display content that may be available on other sites or pages. It is often argued that Google will punish your ratings for displaying duplicate content. I personally do not see this as a big issue and believe that duplicate content can actually help your site’s rating in some instances (more about this another day). Anyway, to stop a bot from accessing a specific page, add the following lines:

User-agent: *
Disallow: */my-duplicate-page.html

Note that this is not a fool-proof method. If your disallowed page has links to it from another site, it will still be crawled by the bots.

I could keep going, but I’m sure you are all bored by now. Feel free to comment below or contact me directly if you wish to know more.

Happy roboting.


SpinChimp Leaderboard 728x90