Getting to Know Your Bots: Robots.txt 101

by on 04/23/2012 • 11 Comments | SEO

It’s Monday and that means it’s time to get fired up about the week. What better way to kick things off than with a quick conversation about robots?!

Yes, ROBOTS!

Not our friendly bot in the picture though, I’m talking about your robots.txt file. Wait, don’t panic! It’s ok if you don’t know what the heck I’m talking about. We’ll start slow and by the end of this post you will know exactly what a robots.txt file is. I want you to be able to throw down in some serious web geek conversations. Why would you want to do that? Because the robots.txt file controls how search engines and other bots see your site. You probably spend a LOT of time carefully crafting your brand and content, but what’s the point if search engines aren’t viewing your content or are seeing too much of it? Let’s fix that.

Your robots.txt is a text file that is housed at [yourdomain].com/robots.txt. Go ahead and check to see if your site has a robots.txt. Go for it, I’ll wait!

Did you find a file there? If you did, it probably looked something like this:

If you found a 404 page or were redirected to another page of your site or there was simply an error message, then your robots.txt doesn’t appear to exist or is setup incorrectly. If that’s the case, I still want you to see what a robots.txt file looks like, so go scope out a robots.txt on another site. I love visiting the robots.txt files on these sites (yes, I’m a freak):

As you can see, robots.txt files come in all shapes and sizes! Why is that? Every site has a different way it’s constructed from the content management system to the server to the code to the shopping cart–there are a ton of variables that can impact the way your site is constructed and appears.

Why does this matter? It matters to web robots. Weeee, robots!

Search engines only know about your site because they are able to (or not able to) access it. While visitors can click freely from page to page of your site (assuming there aren’t logins or other secure areas they have pass through first), robots *should* follow instructions you give them about what to access and how often. The robots.txt is this set of instructions.

If you’re more technical, bear with me, while I give an overly simplified break down of robots and the robots.txt file. I want our readers to walk before they run!

There are all kinds of robots online, some good and some not so good. This may sound a little like Terminator (good vs evil bots) and it is, just without the hostile takeover of humanity (yet). Every search engine has robots that are used to crawl web sites, which is how search engines like Google or Bing are able to crawl content and return search results. Without these robots, they wouldn’t have content to deliver to a searcher. There are “bad” robots that crawl sites for more nefarious purposes or they simply crawl without any care in the world for crawl rates. This disrespect can quickly eat up your bandwidth if not managed well, but is usually not a major problem for smaller sites.

Google’s most famous robot is called Googlebot and you can learn more about it here.

Bing’s robot is called… wait for it… bingbot. Duane Forrester gives us the skinny on bingbot (video in Silverlight, because they’re Microsoft!):

There’s a lot to know about robots.txt files, so I don’t want to overload your brain too much. Let’s just cover the most common robots.txt commands. Now you’ll be able to both read and eventually write your own files without having to bribe the IT team.

How to Use Your Robots.txt

User-agent:

This specifies the type of robot that you want to command. Your robots.txt file will start with this command 99% of the time (the only exception will be if you see a comment or a sitemap: listed above it). Not sure which robot you want to give instructions to? That’s ok. Leave it set to “all” as the default. You do this by writing the following where “*” means “all robots:”

User-agent: *

Allow:

This is exactly what it sounds like. You’re telling robots that you want to allow certain content. You do NOT need to tell robots every page or folder of your site that you want to allow. By default, robots will freely crawl your site. The only instance when you might want to use this is if you’re worried they might not otherwise access a particular page or folder because of another command. For example, if you want to disallow a certain folder, you might want to tell the robots that you still want them to be able to access a certain page in that folder.

Disallow:

The disallow does the opposite of allow. You’re telling the robots that you don’t want them to access a certain page. They can still see the page, but they won’t crawl it. This is probably the most common command for the robots.txt file because it’s how we tell a robot that we don’t want them to look at a junk folder or a set of URLs that shouldn’t exist, but for some technical reason do today. Note: try to always correct the problem at the source rather than using the robots.txt as a band-aid.

Noindex:

The noindex tells robots that you want content removed from their search results. It’s important to use the noindex in situations where you have old content that is returning an error message or folders and areas of your site that you do not want to appear to searchers though you may want to give your users access.

For example, many web developers will build a development version of your new site or redesign on their own domain. If they don’t noindex the folder where your site appears, suddenly there is a duplicate version of your site appearing on their domain!

NO!!! Do not let this happen.

Make sure the folder where your dev site appears has been both disallowed and noindex. This means you do not want the search engines to see the content OR index it. Make sense?

Sitemap:

This is a command that tells the search engines where to access your XML sitemap file. If you’ve submitted your sitemap through their webmaster tool areas then it’s less vital that you have this command, but I always like to include it, so the bots have no question about where to find my file.

Robots.txt warning:
Folder structure matters! Using the same example as the developer working on your redesign, let’s pretend they need to noindex and disallow your site, which is located two folders deep at: [theirdomain].com/portfolio/[yoursite]/

They cannot simply specify /[yoursite]/ as the folder to noindex and disallow. I see this happen often! Someone wants to remove a certain folder, so (using the example above), they’ll write the following:

Disallow: /[yoursite]/
Noindex: /[yoursite]/

Guess what? That isn’t good enough! What you told the robots to do is to disallow and noindex a folder in the ROOT of your domain called /[yoursite]/, but your site isn’t located there. This is absolutely not the same thing as /portfolio/[yoursite]. To ensure that you’re disallowing and noindexing the right folder, you would have to do one of the following:

Option 1:

Disallow: /portfolio/[yoursite]/
Noindex: /portfolio/[yoursite]/

Option 2:

Disallow: /*/[yoursite]/
Noindex: /*/[yoursite]/

Remember when we told the user-agent that we wanted all robots to listen up? How did we do that? We used the asterisk (*). This is the same thing. The asterisk in option two is a placeholder for any folder name. It means that the robots will disallow and noindex any folder with [yoursite] in the name that appears AFTER an initial folder. Now they will be able to see your folder and not crawl or index it. Without stating the name of the first folder or giving a catch all with the asterisk, you aren’t giving the robots sufficient instructions to follow.

It’d be pretty difficult to drive somewhere with step three missing in a five step set of instructions, huh? Same thing!

FINAL WARNING: Be careful. Always validate your robots.txt file and check it before you upload or modify the file. There are horror stories in the vault of SEO lore about sites disappearing from the search results because of a penalty when the reality is they messed up a command in the robots.txt. This is a powerful little file. There is nothing more powerful than this on your site besides the usually hidden .htaccess file (we’ll save this one for another time).

We’ve only begun to scratch the surface of your robots.txt file, so if you’re loving this post, here’s even more to read from authorities far better versed in this than myself:

Robots.txt Resources

  • http://www.robotstxt.org/
  • http://sebastians-pamphlets.com/
  • https://www.google.com/webmasters/tools/
    Click into a domain and go to “Site Configuration” then “Crawler Access.”

    Within Google Webmaster Central, when you click into a site, you can view blocked URLs, construct your robots.txt or remove URLs for that particular domain. You can also test how Google is able to access your site with a few of their robots: Googlebot-Mobile (crawls pages for mobile index), Googlebot-Image (crawls pages for image index), Mediapartners-Google (crawls pages to determine AdSense content), and Adsbot-Google (crawls pages to measure AdWords landing page quality).

Robots.txt Generators

  • http://tools.seobook.com/robots-txt/generator/
    The robots.txt file creation tool works well, but users are limited to the engines in the list and there is no download function of the file after creation.

    And SEOBook has a robots.txt validator, but this doesn’t appear to work well nor are the error reports very clear: http://tools.seobook.com/robots-txt/analyzer/
    SEOBook also features a robots.txt analyzer, but it doesn’t seem to work very well, nor are the error reports very clear.

  • http://www.mcanerin.com/EN/search-engine/robots-txt.asp
    This tool works, but users are limited to only five directories for exclusion and are limited to the engines in the list. There are no download capabilities for the robots.txt file and no apparent way to validate or check the file.

Have any questions? Let’s chat below in the comments. And, make it an awesome Monday!

Share this post

About the Author

Rhea Drysdale

Rhea Drysdale is the Chief Executive Officer of Outspoken Media. When she isn't fighting for the SEO industry, she's She-Ra on Twitter. Connect with Rhea on Google.

Get social with Rhea at Twitter

11 thoughts on “Getting to Know Your Bots: Robots.txt 101

  1. Note – if you already have URLs in the index that you don’t want there, do not NOINDEX them AND block them in robots.txt. Because if you block them in robots.txt, Google won’t see the NOINDEX.

  2. Thanks for the post. As netmeg pointed out, to prevent pages and folders being indexed, I’m much more a fan of the noindex meta tag. It’s also less of a tipoff to some pesky SEO looking for secret internal info. I love visiting a robots.txt file and seeing stuff the site wanted to hide – they’re pointing you right to it!

    I was actually doing some research and putting together some fun robots.txt files with cute comments in them, an idea I had after the whole SEER rankings fiasco. After checking their robots file, they’ve got a finally line in there – it’s worth checking out – seerinteractive.com/robots.txt. Probably the best one I’ve seen is Rishi’s at explicitly.me/robots.txt. I love how these files themselves are indexed, and you need to find a workaround to prevent them from being indexed themselves.

    Anyway, thanks for the post – will definitely reference it when I write about fun robots.txt files.

    UPDATE FROM RHEA: Mark also shared this on Twitter: http://www.seomoz.org/robots.txt

    • Mark, great point about snoopers. Most 101 types wouldn’t be too worried about that, but absolutely when it comes to competitive fields and sites. LOVE SEER’s and Rishi’s robots, thanks for including them. And, it’s always hilarious to see a robots.txt file indexed. Like ours! Let’s fix that. :D

  3. Important note to add: Be aware that your robot.txt file is public for anyone to view. So don’t rely on robots.txt to hide sensitive information from view, because if you have a malicious user visiting your site one of the first places he will look is your robot.txt file to see if there are any sensitive or vulnerable files listed there.

    ie: If he sees this

    Disallow: /secret/important-hidden-secret-stuff.html
    Disallow: /secret/credit-card-numbers.txt

    …and you haven’t taken precautions in securing those files, then you may be in trouble.

  4. I was unaware of robots.txt until around 2 weeks ago when I found Google was indexing my test site! Thankfully I’m now a robots due to various research ad this post I’m now a robots guru!

  5. While some search providers actually read and respect the robots.txt file, _not all do_.

    A recent example I found was Bing.
    We found it was explicitly indexing items that were excluded. Not only that, but it was taking substantial bandwidth (over a period of time) sucking up disallowed content using multiple simultaneous ip connections.

    A check of the IP ranges used verified that they were indeed owned by Microsoft.
    We eventually blocked the ranges in use by Bing for abuse.

    More here –
    http://www.computersolutions.cn/blog/2012/05/msn-bing-crawler-spider-madness/

  6. Hello Rhea,

    Thanks for great post, Can you please help me regarding the following i have some dynamic pages how should i make my robots.txt file so that search engine does not read these pages:

    http://www.abc.com/index.php?option=com_virtuemart&page=shop.pdf_output&showpage=shop.product_details&pop=1&output=pdf&product_id=34&category_id=1&pop=1&vmcchk=1&Itemid=2

    Is this fine:
    User-agent: *
    Noindex:/index.php/*? option=*

    Thanks for your help in advance…

    Regards
    Ashok

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Comments links could be nofollow free.