robots.txt and regular expressions?

Tomasz Chmielewski · May 3, 2008

I'm not sure what is the right group for asking questions about
robots.txt file, so I'm asking it here.

I would like to exclude robots from accessing such links:

/index.php?title=One_page&action=edit
/index.php?title=Other_page&action=edit

What should be a robots.txt line to exclude such pages (for bots which
understand regexps, like Googlebot, Yahoo Slurp etc.)?

1) Disallow: /index.php*action=edit

2) Disallow: /index\.php.*action=edit

According to http://www.google.com/help/faq_codesearch.html#regexp (and
http://en.wikipedia.org/wiki/Regular_expression#Syntax), it should be
the 2) one.

However, almost every "robots.txt regexp" search result seem to point to
the 1) one.

What is the correct answer?

faerber.jan · May 3, 2008

What is the correct answer?

I guess:

/index.php?(.*?)&action=edit

Maybe this helps:
http://blog.searchenginewatch.com/blog/060206-200854
http://erik.eae.net/playground/regexp/regexp.html
.... I think here you must check "global"

Sometimes there are nice guys in the #php irc channels
php news groups I don't know now

greetings Jan

faerber.jan · May 3, 2008

I guess:

/index.php(.*?)action=edit

http://erik.eae.net/playground/regexp/regexp.html ... I think here
you must check "global".
http://blog.searchenginewatch.com/blog/060206-200854
http://sitemaps.blogspot.com/2006/02/more-stats-and-analysis-of-robotstxt.html

bye Jan

faerber.jan · May 4, 2008

The robots.txt standard permits *no* wildcards or regexes in pathnames.

Google has introduced their own extensions to the standard, but they're
not regular expressions. Other bots might respect Google's extensions,
or they might not. (Mine doesn't.)

It might be useful to keep in mind that robots.txt is a somewhat weak
standard. It's never been formalized the way that (for example) HTML and
HTTP have been formalized. There's no robots.txt RFC, for instance. The
standard is based on the rules described here:http://www.robotstxt.org/

A long time ago, there was an attempt to formalize the robots.txt
standard, but it was never completed. Here's the draft RFC if you would
like to read it:http://www.robotstxt.org/norobots-rfc.txt

but some regex is allowed like

Disallow: /*.php$

(isn't it?)

which blocks access to all your php files.

Jan

Request for Feedback; a module making it easier to use regular expressions.	1	Jan 31, 2005
ANN: 'rex', a module for easy creation and use of regular expressions	0	Jun 10, 2004
ANN: 'rex' 0.5, a module for easier creation and use of regular expressions.	0	Jun 27, 2004
grammar for where/letting/with and suite expressions (thunks etc)	0	Apr 19, 2005
[ANN] assert2-0.4.6 provides assert_xhtml, an alternative to assert_select	0	Mar 26, 2009
Elise Mooney reports on Channel 9 about Maths Worldwide and the fraudthat it is	1	Apr 17, 2010
fprintf slower than printf and redirect?	1	Nov 29, 2008
Roundup of FAQ change requests	4	Dec 6, 2004

robots.txt and regular expressions?

Tomasz Chmielewski

faerber.jan

faerber.jan

faerber.jan

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads