robots.txt and regular expressions?

  • Thread starter Tomasz Chmielewski
  • Start date

T

Tomasz Chmielewski

I'm not sure what is the right group for asking questions about
robots.txt file, so I'm asking it here.

I would like to exclude robots from accessing such links:

/index.php?title=One_page&action=edit
/index.php?title=Other_page&action=edit

What should be a robots.txt line to exclude such pages (for bots which
understand regexps, like Googlebot, Yahoo Slurp etc.)?

1) Disallow: /index.php*action=edit

2) Disallow: /index\.php.*action=edit


According to http://www.google.com/help/faq_codesearch.html#regexp (and
http://en.wikipedia.org/wiki/Regular_expression#Syntax), it should be
the 2) one.

However, almost every "robots.txt regexp" search result seem to point to
the 1) one.

What is the correct answer?
 
Ad

Advertisements

Ad

Advertisements

F

faerber.jan

The robots.txt standard permits *no* wildcards or regexes in pathnames.

Google has introduced their own extensions to the standard, but they're
not regular expressions. Other bots might respect Google's extensions,
or they might not. (Mine doesn't.)

It might be useful to keep in mind that robots.txt is a somewhat weak
standard. It's never been formalized the way that (for example) HTML and
HTTP have been formalized. There's no robots.txt RFC, for instance. The
standard is based on the rules described here:http://www.robotstxt.org/

A long time ago, there was an attempt to formalize the robots.txt
standard, but it was never completed. Here's the draft RFC if you would
like to read it:http://www.robotstxt.org/norobots-rfc.txt


but some regex is allowed like

Disallow: /*.php$

(isn't it?)

which blocks access to all your php files.

Jan
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top