Remove URLs using a robots.txt

Posted On: 06/May/2010

Filed Under:

By: munkyonline

It's likely that you have pages on your website that you don't want to be indexed by the search engines. These could be pages like your privacy policy or simply pages that you don't want to be accessible to the public. In the first case if the said page is a page like the privacy policy which is linked and accessible via your website then the page can be blocked using a robots.txt file.

Creating a robots.txt

A robots.txt file is quite simply a standard text file which can be created with any text editor like notepad and saved with the .txt extension. Upload a robots.txt into the root of your website so it can be found by search engines here - http://www.domain.com/robots.txt.

Denying bots from indexing using a robots.txt file

To deny bots from an entire website:

User-agent: *
Disallow: /

To deny all bots from indexing a specific page:

User-agent: *
Disallow: /page.html

To deny all bots from indexing a folder you can use the following:

User-agent: *
Disallow: /folder/

To deny all bots from indexing any URL containing 'monkey' by using a wildcard:

User-agent: *
Disallow: /*monkey

To deny dynamic URLs which contain a '?' use the same method - again a wildcard:

User-agent: *
Disallow: /*?

To specify which bot you want to block you can change the User Agent. To deny Googlebot:

User-agent: Googlebot
Disallow: /page.html
Disallow: /folder/
Disallow: /*monkey
Disallow: /*?

How To remove a page from Google that has been indexed

Google supports the noindex directive, so if you specify a page using the Noindex directive within a robots.txt you can then login to Google Webmaster Tools go to Site Configuration > Crawler Access > Remove URL and ask them to remove it.

User-agent: Googlebot
Noindex: /page.html

When to not use a robots.txt

Just like you can access a web page though a browser, anyone can look at your robots.txt file so it's important you don't use a robots.txt to block a private page or a page that hasn't even been linked to from your website (in that case the bot wont be able to find it anyway).

The other issue is not all dynamic URLs have a pattern that will allow them to be easily blocked by a robots.txt, for that you can use another method to deliver the robots directive by setting an X-Robots Tag.

Using X-Robots Tag

Setting a X-Robots Tag is the more discreet way of blocking a URL. You can test your page header using this HTTP Request and Response Header Tool.

With PHP you can tell bots to not index, archive, show a snippet or 'nofollow' the links on the page :

header("X-Robots-Tag: noindex, nofollow, noarchive, nosnippet", true);

Using a htaccess file you can do the same using FilesMatch:

<FilesMatch "page\.html">
Header set X-Robots-Tag "noindex, noarchive, nosnippet"
</FilesMatch>