Regex anomaly

M

mike.klaas

Hello,

Has anyone has issue with compiled re's vis-a-vis the re.I (ignore
case) flag? I can't make sense of this compiled re producing a
different match when given the flag, odd both in it's difference from
the uncompiled regex (as I thought the uncompiled api was a wrapper
around a compile-and-execute block) and it's difference from the
compiled version with no flag specified. The match given is utter
nonsense given the input re.

In [48]: import re
In [49]: reStr = r"([a-z]+)://"
In [51]: against = "http://www.hello.com"
In [53]: re.match(reStr, against).groups()
Out[53]: ('http',)
In [54]: re.match(reStr, against, re.I).groups()
Out[54]: ('http',)
In [55]: reCompiled = re.compile(reStr)
In [56]: reCompiled.match(against).groups()
Out[56]: ('http',)
In [57]: reCompiled.match(against, re.I).groups()
Out[57]: ('tp',)

cheers,
-Mike
 
R

Roy Smith

Hello,

Has anyone has issue with compiled re's vis-a-vis the re.I (ignore
case) flag? I can't make sense of this compiled re producing a
different match when given the flag, odd both in it's difference from
the uncompiled regex (as I thought the uncompiled api was a wrapper
around a compile-and-execute block) and it's difference from the
compiled version with no flag specified. The match given is utter
nonsense given the input re.

In [48]: import re
In [49]: reStr = r"([a-z]+)://"
In [51]: against = "http://www.hello.com"
In [53]: re.match(reStr, against).groups()
Out[53]: ('http',)
In [54]: re.match(reStr, against, re.I).groups()
Out[54]: ('http',)
In [55]: reCompiled = re.compile(reStr)
In [56]: reCompiled.match(against).groups()
Out[56]: ('http',)
In [57]: reCompiled.match(against, re.I).groups()
Out[57]: ('tp',)

LOL, and you'll be LOL too when you see the problem :)

You can't give the re.I flag to reCompiled.match(). You have to give
it to re.compile(). The second argument to reCompiled.match() is the
position where to start searching. I'm guessing re.I is defined as 2,
which explains the match you got.

This is actually one of those places where duck typing let us down.
If we had type bondage, re.I would be an instance of RegExFlags, and
reCompiled.match() would have thrown a TypeError when the second
argument wasn't an integer. I'm not saying type bondage is inherently
better than duck typing, just that it has its benefits at times.
 
A

Andrew Durdin

Has anyone has issue with compiled re's vis-a-vis the re.I (ignore
case) flag? I can't make sense of this compiled re producing a
different match when given the flag, odd both in it's difference from
the uncompiled regex (as I thought the uncompiled api was a wrapper
around a compile-and-execute block) and it's difference from the
compiled version with no flag specified. The match given is utter
nonsense given the input re.

The re.compile and re.match methods take the flag parameter:

compile( pattern[, flags])
match( pattern, string[, flags])

But the regular expression object method takes different paramters:

match( string[, pos[, endpos]])

It's not a little confusing that the parameters to re.match() and
re.compile().match() are so different, but that's the cause of what
you're seeing.

You need to do:

reCompiled = re.compile(reStr, re.I)
reCompiled.match(against).groups()

to get the behaviour you want.

Andrew
 
G

Ganesan Rajagopal

mike klaas said:
In [48]: import re
In [49]: reStr = r"([a-z]+)://"
In [51]: against = "http://www.hello.com"
In [53]: re.match(reStr, against).groups()
Out[53]: ('http',)
In [54]: re.match(reStr, against, re.I).groups()
Out[54]: ('http',)
In [55]: reCompiled = re.compile(reStr)
In [56]: reCompiled.match(against).groups()
Out[56]: ('http',)
In [57]: reCompiled.match(against, re.I).groups()
Out[57]: ('tp',)

I can reproduce this on Debian Linux testing, both python 2.3 and python
2.4. Seems like a bug. search() also exhibits the same behavior.

Ganesan
 
M

mike.klaas

Thanks guys, that is probably the most ridiculous mistake I've made in
years <g>

-Mike
 
R

Roy Smith

Thanks guys, that is probably the most ridiculous mistake I've made in
years <g>

-Mike

If that's the more ridiculous you can come up with, you're not trying hard
enough. I've done much worse.
 
G

Ganesan Rajagopal

mike klaas said:
Thanks guys, that is probably the most ridiculous mistake I've made in
years <g>

I was taken too :). This is quite embarassing, considering that I remember
reading a big thread in python devel list about this a while back!

Ganesan
 
S

Sam Pointon

Would this particular inconsistency be candidate for change in Py3k?
Seems to me the pos and endpos arguments are redundant with slicing,
and the re.match function would benefit from having the same arguments
as pattern.match. Of course, this is a backwards-incompatible change;
that's why I suggested Py3k.
 
A

Andrew Durdin

Would this particular inconsistency be candidate for change in Py3k?
Seems to me the pos and endpos arguments are redundant with slicing,

Being able to specify the start and end indices for a search is
important when working with very large strings (multimegabyte) --
where slicing would create a copy, specifying pos and endpos allows
for memory-efficient searching in limited areas of a string.
and the re.match function would benefit from having the same arguments
as pattern.match.

Not at all; the flags need to be specified when the regex is compiled,
as they affect the compiled representation (finite state automaton I
expect) of the regex. If the flags were given in pattern.match(), then
there'd be no performance benefit gained from precompiling the regex.

Andrew
 
R

Roy Smith

"Sam Pointon said:
Would this particular inconsistency be candidate for change in Py3k?
Seems to me the pos and endpos arguments are redundant with slicing,
and the re.match function would benefit from having the same arguments
as pattern.match. Of course, this is a backwards-incompatible change;
that's why I suggested Py3k.

I don't see any way to implement re.I at match time; it's something that
needs to get done at regex compile time. It's available in the
module-level match() call, because that one is really compile-then-match().
 
R

Ron Garret

Roy Smith said:
I don't see any way to implement re.I at match time;

It's easy: just compile two machines, one with re.I and one without and
package them as if they were one. Then use the flag to pick a compiled
machine at run time.

rg
 
B

Bryan Olson

Roy said:
LOL, and you'll be LOL too when you see the problem :)

You can't give the re.I flag to reCompiled.match(). You have to give
it to re.compile(). The second argument to reCompiled.match() is the
position where to start searching. I'm guessing re.I is defined as 2,
which explains the match you got.

This is actually one of those places where duck typing let us down.
If we had type bondage, re.I would be an instance of RegExFlags, and
reCompiled.match() would have thrown a TypeError when the second
argument wasn't an integer. I'm not saying type bondage is inherently
better than duck typing, just that it has its benefits at times.


Even with duck-typing, we could cut our users a break. Making
our flags instances of a distinct class doesn't actually require
type bondage.

We could define the __or__ method for RegExFlags, but really,
or-ing together integer flags is old habit from low-level
languages. Really we should pass a set of flags.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,777
Messages
2,569,604
Members
45,216
Latest member
topweb3twitterchannels

Latest Threads

Top