robots.txt and regular expressions?

  • Thread starter Tomasz Chmielewski
  • Start date


Tomasz Chmielewski

I'm not sure what is the right group for asking questions about
robots.txt file, so I'm asking it here.

I would like to exclude robots from accessing such links:


What should be a robots.txt line to exclude such pages (for bots which
understand regexps, like Googlebot, Yahoo Slurp etc.)?

1) Disallow: /index.php*action=edit

2) Disallow: /index\.php.*action=edit

According to (and, it should be
the 2) one.

However, almost every "robots.txt regexp" search result seem to point to
the 1) one.

What is the correct answer?






The robots.txt standard permits *no* wildcards or regexes in pathnames.

Google has introduced their own extensions to the standard, but they're
not regular expressions. Other bots might respect Google's extensions,
or they might not. (Mine doesn't.)

It might be useful to keep in mind that robots.txt is a somewhat weak
standard. It's never been formalized the way that (for example) HTML and
HTTP have been formalized. There's no robots.txt RFC, for instance. The
standard is based on the rules described here:

A long time ago, there was an attempt to formalize the robots.txt
standard, but it was never completed. Here's the draft RFC if you would
like to read it:

but some regex is allowed like

Disallow: /*.php$

(isn't it?)

which blocks access to all your php files.


Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question