Python 2.7 re.IGNORECASE broken in re.sub?

C

Christopher

I have the following problem:

Python 2.7 (r27:82525, Jul 4 2010, 07:43:08) [MSC v.1500 64 bit
(AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
---

Perhaps this is intended behavior, but it seems like the last two
results should be the same, not the first two. In other words, the
call to re.sub with re.IGNORECASE on should return "Python27" not
"Python26".

This appears to be the case when using compiled pattern matching:
'Python27'
 
S

Steven D'Aprano

I have the following problem:


Is this a known bug? Is it by design for some odd reason?

Help on function sub in module re:

sub(pattern, repl, string, count=0)
...


You're passing re.IGNORECASE (which happens to equal 2) as a count
argument, not as a flag. Try this instead:
'Python27'
 
A

Alex Willmer

You're passing re.IGNORECASE (which happens to equal 2) as a count
argument, not as a flag. Try this instead:

'Python27'

Basically right, but in-line flags must be placed at the start of a
pattern, or the result is undefined. Also in Python 2.7 re.sub() has a
flags argument.

Python 2.7.0+ (release27-maint:83286, Aug 16 2010, 01:25:58)
[GCC 4.4.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.'Python27'

Alex
 
M

MRAB

Alex said:
Basically right, but in-line flags must be placed at the start of a
pattern, or the result is undefined. Also in Python 2.7 re.sub() has a
flags argument.
[snip]
In re such flags apply to the entire regex, no matter where they appear.
This even applies to the (?x) (VERBOSE) flag; if re sees it at the end
of the regex then it has to re-scan the entire regex!

For clarity and compatibility with other regex implementations, put it
initially.
 
S

Steven D'Aprano

Basically right, but in-line flags must be placed at the start of a
pattern, or the result is undefined.

Pardon me, but that's clearly not correct, as proven by the fact that the
above example works.

You can say that the flags *should* go at the start, for the sake of
efficiency, or ease of comprehension, or tradition, or to appease the
Regex Cops who roam the streets beating up those who don't write regexes
in the approved fashion. But it isn't true that they *must* go at the
front.
 
C

Christopher

Help on function sub in module re:

    sub(pattern, repl, string, count=0)
    ...

You're passing re.IGNORECASE (which happens to equal 2) as a count
argument, not as a flag. Try this instead:


'Python27'

Thanks. Somehow I didn't notice that other argument after looking at
it a million times. :)
 
A

Alex Willmer

Pardon me, but that's clearly not correct, as proven by the fact that the
above example works.

Undefined includes 'might work sometimes'. I refer you to the Python
documentation:

"Note that the (?x) flag changes how the expression is parsed. It
should be used first in the expression string, or after one or more
whitespace characters. If there are non-whitespace characters before
the flag, the results are undefined."
http://docs.python.org/library/re.html#regular-expression-syntax
 
A

Alex Willmer

"Note that the (?x) flag changes how the expression is parsed. It
should be used first in the expression string, or after one or more
whitespace characters. If there are non-whitespace characters before
the flag, the results are undefined.
"http://docs.python.org/library/re.html#regular-expression-syntax

Hmm, I found a lot of instances that place (?iLmsux) after non-
whitespace characters

http://google.com/codesearch?hl=en&lr=&q=file:\.py[w]?$+[^[:space:]"']+\(\?[iLmsux]+\)

including two from the Python unit tests, re_test.py lines 109-110.
Perhaps the documentation is overly cautious..
 
S

Steven D'Aprano

Undefined includes 'might work sometimes'. I refer you to the Python
documentation:

"Note that the (?x) flag changes how the expression is parsed. It should
be used first in the expression string, or after one or more whitespace
characters. If there are non-whitespace characters before the flag, the
results are undefined."
http://docs.python.org/library/re.html#regular-expression-syntax


Well so it does. I stand corrected.

I note though that even the docs say "should" rather than "must". I
wonder whether the documentation author is just being cautious, because
I've seen comments on the python-dev list that imply that the current
behaviour of flags (that their effect is global to the regex) is
supported. E.g.:

http://code.activestate.com/lists/python-dev/98681/

At the point that people are seriously considering changing the behaviour
of a replacement re engine in order to support the current "undefined"
behaviour, perhaps that behaviour isn't quite so undefined and the docs
need to be re-written?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,009
Latest member
GidgetGamb

Latest Threads

Top