Understanding '?' in regular expressions

K

krishna.k.kishor3

Can someone explain the below behavior please?
re1 = re.compile(r'(?:((?:1000|1010|1020))[ ]*?[\,]?[ ]*?){1,3}')
re.findall(re_obj,'1000,1020,1000') ['1000']
re.findall(re_obj,'1000,1020, 1000')
['1020', '1000']

However when I use "[\,]??" instead of "[\,]?" as below, I see a different result
re2 = re.compile(r'(?:((?:1000|1010|1020))[ ]*?[\,]??[ ]*?){1,3}')
re.findall(re_obj,'1000,1020,1000')
['1000', '1020', '1000']

I am not able to understand what's causing the difference of behavior here, I am assuming it's not 'greediness' if "?"

Thank you,
Kishor
 
J

Jussi Piitulainen

Can someone explain the below behavior please?
re1 = re.compile(r'(?:((?:1000|1010|1020))[ ]*?[\,]?[ ]*?){1,3}')
re.findall(re_obj,'1000,1020,1000') ['1000']
re.findall(re_obj,'1000,1020, 1000')
['1020', '1000']

However when I use "[\,]??" instead of "[\,]?" as below, I see a
different result
re2 = re.compile(r'(?:((?:1000|1010|1020))[ ]*?[\,]??[ ]*?){1,3}')
re.findall(re_obj,'1000,1020,1000')
['1000', '1020', '1000']

Those re_obj should be re1 and re2, respectively. With that
correction, the behaviour appears to be as you say.
I am not able to understand what's causing the difference of
behavior here, I am assuming it's not 'greediness' if "?"

But the greed seems to be the only the difference.

I can't wrap my mind around this (at the moment at least) and I need
to rush away, but may I suggest the removal of all that is not
relevant to the problem at hand. Study these instead:
re.findall(r'(10.0,?){1,3}', '1000,1020,1000') ['1000']
re.findall(r'(10.0,??){1,3}', '1000,1020,1000')
['1000', '1020', '1000']
 
I

Ian Kelly

Can someone explain the below behavior please?
re1 = re.compile(r'(?:((?:1000|1010|1020))[ ]*?[\,]?[ ]*?){1,3}')
re.findall(re_obj,'1000,1020,1000') ['1000']
re.findall(re_obj,'1000,1020, 1000')
['1020', '1000']

Try removing the grouping parentheses to see the full strings being matched:
re1 = re.compile(r'(?:(?:1000|1010|1020)[ ]*?[\,]?[ ]*?){1,3}')
re.findall(re1,'1000,1020,1000') ['1000,1020,1000']
re.findall(re1,'1000,1020, 1000')
['1000,1020,', '1000']

In the first case, the regular expression is matching the full string.
It could also match shorter expressions, but as only the space
quantifiers are non-greedy and there are no spaces to match anyway, it
does not. With the grouping parentheses in place, only the *last*
value of the group is returned, which is why you only see the last
'1000' instead of all three strings in the group, even though the
group is actually matching three different substrings.

In the second case, the regular expression finds first the '1000,1020'
and then the '1000' as two separate matches. The reason for this is
the space. Since the quantifier on the space is non-greedy, it first
tries *not* matching the space, finds that it has a valid match, and
so does not backtrack. The '1000' is then identified as a separate
match. As before, with the grouping parentheses in place you see only
the '1020' and the last '1000' because the group only reports the last
substring it matched for that particular match.

However when I use "[\,]??" instead of "[\,]?" as below, I see a different result
re2 = re.compile(r'(?:((?:1000|1010|1020))[ ]*?[\,]??[ ]*?){1,3}')
re.findall(re_obj,'1000,1020,1000')
['1000', '1020', '1000']

I am not able to understand what's causing the difference of behavior here, I am assuming it's not 'greediness' if "?"

The difference is the non-greediness of the comma quantifier. When it
comes time for it to match the comma, because the quantifier is
non-greedy, it first tries *not* matching the comma, whereas before it
first tried to match it. As with the space above, when the comma is
not matched, it finds that it has a valid match anyway if it just
stops matching immediately. So it does not need to backtrack, and in
this case it ends up terminating each match early upon the comma and
returning all three numbers as separate matches.

What exactly is it that you're trying to do with this regular
expression? I suspect that it the solution actually a lot simpler
than you're making it.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,576
Members
45,054
Latest member
LucyCarper

Latest Threads

Top