regex confusion

John Hunter · Dec 9, 2003

In trying to sdebug why a certain regex wasn't working like I expected
it to, I came across this strange (to me) behavior. The file I am
trying to match definitely contains many instances of the letter 'a',
so I would expect the regex

rgxPrev = re.compile('.*?a.*?')

to match it the string contents of the file. But it doesn't. Here is
a complete example

import re, urllib
rgxPrev = re.compile('.*?a.*?')

url = 'http://nitace.bsd.uchicago.edu:8080/files/share/showdown_example2.html'
s = urllib.urlopen(url).read()
m = rgxPrev.match(s)
print m
print s.find('a')

m is None (no match) and the s.find('a') reports an 'a' at index 48.

I read the regex to mean non-greedy match of anything up to an a,
followed by non-greedy match of anything following an a, which this
file should match.

Or am I insane?

John Hunter

hunter:~/python/projects/poker/data/pokerroom> uname -a
Linux hunter.paradise.lost 2.4.20-8smp #1 SMP Thu Mar 13 17:45:54 EST 2003 i686
i686 i386 GNU/Linux
hunter:~/python/projects/poker/data/pokerroom> python
Python 2.3.2 (#1, Oct 13 2003, 11:33:15)
[GCC 3.3.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Welcome to rlcompleter2 0.95
for nice experiences hit <tab> multiple times

Luther Barnum · Dec 9, 2003

MAybe you meant:
import re, urllib
rgxPrev = re.compile('.*?a.*?')

url =
'http://nitace.bsd.uchicago.edu:8080/files/share/showdown_example2.html'
s = urllib.urlopen(url).read()
***m = match(rgxPrev,s)***
print m
print s.find('a')

match takes two arguments

Diez B. Roggisch · Dec 9, 2003

John said:
In trying to sdebug why a certain regex wasn't working like I expected
it to, I came across this strange (to me) behavior. The file I am
trying to match definitely contains many instances of the letter 'a',
so I would expect the regex

rgxPrev = re.compile('.*?a.*?')

This is a bogus regex - a '*' means "zero or more occurences" for the
expression to the left. '?' means "zero or one occurence" for the exp to
the left. I'm not exactly sure why this is not working, but its definitely
redundant. Eliminiating the redundancy gives you this:

rgxPrev = re.compile('.*a.*')

Works perfect.

Regards,

Diez

A.M. Kuchling · Dec 9, 2003

rgxPrev = re.compile('.*?a.*?')

.. doesn't match newlines unless you specify the re.DOTALL / (?s) flag, so it
won't match unless 'a' is on the very first line. Add (?s) to your
expression, and it should work (though it'll be much slower than the .find()
method).

--amk

Peter Hansen · Dec 9, 2003

Diez B. Roggisch said:
This is a bogus regex - a '*' means "zero or more occurences" for the
expression to the left. '?' means "zero or one occurence" for the exp to
the left.

Not true. See http://www.python.org/doc/current/lib/re-syntax.html :

*?, +?, ??
The "*", "+", and "?" qualifiers are all greedy; they match as much text
as possible. .... Adding "?" after the qualifier makes it perform the match
in non-greedy or minimal fashion; as few characters as possible will be
matched. ....

-Peter

Peter Otten · Dec 9, 2003

John said:
In trying to sdebug why a certain regex wasn't working like I expected
it to, I came across this strange (to me) behavior. The file I am
trying to match definitely contains many instances of the letter 'a',
so I would expect the regex

rgxPrev = re.compile('.*?a.*?')

to match it the string contents of the file. But it doesn't. Here is

[...]

I read the regex to mean non-greedy match of anything up to an a,
followed by non-greedy match of anything following an a, which this
file should match.

There is a nice example where non-greedy regexes are really useful in A. M.
Kuchling's Regex Howto (http://www.amk.ca/python/howto/regex/regex.html)

Or am I insane?

This may be off-topic, but the easiest if not fastest way to find multiple
occurences of a string in a text is:
.... print m.start()
....
0
3
5
Peter

Diez B. Roggisch · Dec 9, 2003

This is a bogus regex - a '*' means "zero or more occurences" for the

Not true. See http://www.python.org/doc/current/lib/re-syntax.html :

*?, +?, ??
The "*", "+", and "?" qualifiers are all greedy; they match as much text
as possible. .... Adding "?" after the qualifier makes it perform the
match in non-greedy or minimal fashion; as few characters as possible will
be matched. ....

Hmm. But when thats true, what does ".??" then mean - the first ? is not
greedy, so it is nothing matched at all. The same is true for ".*?", and
".+?" is then equal to "." So what makes this useful? The regex in question
definitely didn't work with it.

Diez

Diez B. Roggisch · Dec 9, 2003

Hmm. But when thats true, what does ".??" then mean - the first ? is not
greedy, so it is nothing matched at all. The same is true for ".*?", and
".+?" is then equal to "." So what makes this useful? The regex in
question definitely didn't work with it.

Ok - I just found out - it makes sense when taking into account what follows
in the regex, as that will be matched earlier. Neat - didn't know that such
things existed.

Diez

John Hunter · Dec 9, 2003

Peter> This may be off-topic, but the easiest if not fastest way
Peter> to find multiple occurences of a string in a text is:

Right, I actually am using regex matching and not literal char
matching, but in trying to debug why my regex wasn't working, I
simplified it to the simplest case I could, which was a string
literal.

Thanks for the DOTALL pointer above.

JDH

regex line by line over file	8	Mar 27, 2014
Why is regex so slow?	21	Jun 18, 2013
Python pyPDF4 code to bookmark pdf based upon date text	1	Jan 18, 2023
My regex kung-fu is not strong =(	0	Apr 4, 2020
mmap regex search replace	0	Apr 3, 2009
Help with regex	11	Aug 6, 2009
RegEx issues	6	Jan 24, 2009
non-terminating regex match	5	Apr 2, 2008

regex confusion

John Hunter

Luther Barnum

Diez B. Roggisch

A.M. Kuchling

Peter Hansen

Peter Otten

Diez B. Roggisch

Diez B. Roggisch

John Hunter

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads