regex confusion

J

John Hunter

In trying to sdebug why a certain regex wasn't working like I expected
it to, I came across this strange (to me) behavior. The file I am
trying to match definitely contains many instances of the letter 'a',
so I would expect the regex

rgxPrev = re.compile('.*?a.*?')

to match it the string contents of the file. But it doesn't. Here is
a complete example

import re, urllib
rgxPrev = re.compile('.*?a.*?')

url = 'http://nitace.bsd.uchicago.edu:8080/files/share/showdown_example2.html'
s = urllib.urlopen(url).read()
m = rgxPrev.match(s)
print m
print s.find('a')

m is None (no match) and the s.find('a') reports an 'a' at index 48.

I read the regex to mean non-greedy match of anything up to an a,
followed by non-greedy match of anything following an a, which this
file should match.

Or am I insane?

John Hunter


hunter:~/python/projects/poker/data/pokerroom> uname -a
Linux hunter.paradise.lost 2.4.20-8smp #1 SMP Thu Mar 13 17:45:54 EST 2003 i686
i686 i386 GNU/Linux
hunter:~/python/projects/poker/data/pokerroom> python
Python 2.3.2 (#1, Oct 13 2003, 11:33:15)
[GCC 3.3.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Welcome to rlcompleter2 0.95
for nice experiences hit <tab> multiple times
 
D

Diez B. Roggisch

John said:
In trying to sdebug why a certain regex wasn't working like I expected
it to, I came across this strange (to me) behavior. The file I am
trying to match definitely contains many instances of the letter 'a',
so I would expect the regex

rgxPrev = re.compile('.*?a.*?')

This is a bogus regex - a '*' means "zero or more occurences" for the
expression to the left. '?' means "zero or one occurence" for the exp to
the left. I'm not exactly sure why this is not working, but its definitely
redundant. Eliminiating the redundancy gives you this:

rgxPrev = re.compile('.*a.*')

Works perfect.

Regards,

Diez
 
A

A.M. Kuchling

rgxPrev = re.compile('.*?a.*?')

.. doesn't match newlines unless you specify the re.DOTALL / (?s) flag, so it
won't match unless 'a' is on the very first line. Add (?s) to your
expression, and it should work (though it'll be much slower than the .find()
method).

--amk
 
P

Peter Hansen

Diez B. Roggisch said:
This is a bogus regex - a '*' means "zero or more occurences" for the
expression to the left. '?' means "zero or one occurence" for the exp to
the left.

Not true. See http://www.python.org/doc/current/lib/re-syntax.html :

*?, +?, ??
The "*", "+", and "?" qualifiers are all greedy; they match as much text
as possible. .... Adding "?" after the qualifier makes it perform the match
in non-greedy or minimal fashion; as few characters as possible will be
matched. ....

-Peter
 
P

Peter Otten

John said:
In trying to sdebug why a certain regex wasn't working like I expected
it to, I came across this strange (to me) behavior. The file I am
trying to match definitely contains many instances of the letter 'a',
so I would expect the regex

rgxPrev = re.compile('.*?a.*?')

to match it the string contents of the file. But it doesn't. Here is

[...]
I read the regex to mean non-greedy match of anything up to an a,
followed by non-greedy match of anything following an a, which this
file should match.

There is a nice example where non-greedy regexes are really useful in A. M.
Kuchling's Regex Howto (http://www.amk.ca/python/howto/regex/regex.html)
Or am I insane?

This may be off-topic, but the easiest if not fastest way to find multiple
occurences of a string in a text is:
.... print m.start()
....
0
3
5
Peter
 
D

Diez B. Roggisch

This is a bogus regex - a '*' means "zero or more occurences" for the
Not true. See http://www.python.org/doc/current/lib/re-syntax.html :

*?, +?, ??
The "*", "+", and "?" qualifiers are all greedy; they match as much text
as possible. .... Adding "?" after the qualifier makes it perform the
match in non-greedy or minimal fashion; as few characters as possible will
be matched. ....

Hmm. But when thats true, what does ".??" then mean - the first ? is not
greedy, so it is nothing matched at all. The same is true for ".*?", and
".+?" is then equal to "." So what makes this useful? The regex in question
definitely didn't work with it.

Diez
 
D

Diez B. Roggisch

Hmm. But when thats true, what does ".??" then mean - the first ? is not
greedy, so it is nothing matched at all. The same is true for ".*?", and
".+?" is then equal to "." So what makes this useful? The regex in
question definitely didn't work with it.

Ok - I just found out - it makes sense when taking into account what follows
in the regex, as that will be matched earlier. Neat - didn't know that such
things existed.

Diez
 
J

John Hunter

Peter> This may be off-topic, but the easiest if not fastest way
Peter> to find multiple occurences of a string in a text is:

Right, I actually am using regex matching and not literal char
matching, but in trying to debug why my regex wasn't working, I
simplified it to the simplest case I could, which was a string
literal.

Thanks for the DOTALL pointer above.

JDH
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top