Understanding '?' in regular expressions

Discussion in 'Python' started by krishna.k.kishor3@gmail.com, Nov 16, 2012.

  1. Guest

    Can someone explain the below behavior please?

    >>> re1 = re.compile(r'(?:((?:1000|1010|1020))[ ]*?[\,]?[ ]*?){1,3}')
    >>> re.findall(re_obj,'1000,1020,1000')

    ['1000']
    >>> re.findall(re_obj,'1000,1020, 1000')

    ['1020', '1000']

    However when I use "[\,]??" instead of "[\,]?" as below, I see a different result
    >>> re2 = re.compile(r'(?:((?:1000|1010|1020))[ ]*?[\,]??[ ]*?){1,3}')
    >>> re.findall(re_obj,'1000,1020,1000')

    ['1000', '1020', '1000']

    I am not able to understand what's causing the difference of behavior here, I am assuming it's not 'greediness' if "?"

    Thank you,
    Kishor
    , Nov 16, 2012
    #1
    1. Advertising

  2. writes:

    > Can someone explain the below behavior please?
    >
    > >>> re1 = re.compile(r'(?:((?:1000|1010|1020))[ ]*?[\,]?[ ]*?){1,3}')
    > >>> re.findall(re_obj,'1000,1020,1000')

    > ['1000']
    > >>> re.findall(re_obj,'1000,1020, 1000')

    > ['1020', '1000']
    >
    > However when I use "[\,]??" instead of "[\,]?" as below, I see a
    > different result
    > >>> re2 = re.compile(r'(?:((?:1000|1010|1020))[ ]*?[\,]??[ ]*?){1,3}')
    > >>> re.findall(re_obj,'1000,1020,1000')

    > ['1000', '1020', '1000']


    Those re_obj should be re1 and re2, respectively. With that
    correction, the behaviour appears to be as you say.

    > I am not able to understand what's causing the difference of
    > behavior here, I am assuming it's not 'greediness' if "?"


    But the greed seems to be the only the difference.

    I can't wrap my mind around this (at the moment at least) and I need
    to rush away, but may I suggest the removal of all that is not
    relevant to the problem at hand. Study these instead:

    >>> re.findall(r'(10.0,?){1,3}', '1000,1020,1000')

    ['1000']
    >>> re.findall(r'(10.0,??){1,3}', '1000,1020,1000')

    ['1000', '1020', '1000']
    Jussi Piitulainen, Nov 16, 2012
    #2
    1. Advertising

  3. Ian Kelly Guest

    On Fri, Nov 16, 2012 at 12:28 AM, <> wrote:
    > Can someone explain the below behavior please?
    >
    >>>> re1 = re.compile(r'(?:((?:1000|1010|1020))[ ]*?[\,]?[ ]*?){1,3}')
    >>>> re.findall(re_obj,'1000,1020,1000')

    > ['1000']
    >>>> re.findall(re_obj,'1000,1020, 1000')

    > ['1020', '1000']


    Try removing the grouping parentheses to see the full strings being matched:

    >>> re1 = re.compile(r'(?:(?:1000|1010|1020)[ ]*?[\,]?[ ]*?){1,3}')
    >>> re.findall(re1,'1000,1020,1000')

    ['1000,1020,1000']
    >>> re.findall(re1,'1000,1020, 1000')

    ['1000,1020,', '1000']

    In the first case, the regular expression is matching the full string.
    It could also match shorter expressions, but as only the space
    quantifiers are non-greedy and there are no spaces to match anyway, it
    does not. With the grouping parentheses in place, only the *last*
    value of the group is returned, which is why you only see the last
    '1000' instead of all three strings in the group, even though the
    group is actually matching three different substrings.

    In the second case, the regular expression finds first the '1000,1020'
    and then the '1000' as two separate matches. The reason for this is
    the space. Since the quantifier on the space is non-greedy, it first
    tries *not* matching the space, finds that it has a valid match, and
    so does not backtrack. The '1000' is then identified as a separate
    match. As before, with the grouping parentheses in place you see only
    the '1020' and the last '1000' because the group only reports the last
    substring it matched for that particular match.


    > However when I use "[\,]??" instead of "[\,]?" as below, I see a different result
    >>>> re2 = re.compile(r'(?:((?:1000|1010|1020))[ ]*?[\,]??[ ]*?){1,3}')
    >>>> re.findall(re_obj,'1000,1020,1000')

    > ['1000', '1020', '1000']
    >
    > I am not able to understand what's causing the difference of behavior here, I am assuming it's not 'greediness' if "?"


    The difference is the non-greediness of the comma quantifier. When it
    comes time for it to match the comma, because the quantifier is
    non-greedy, it first tries *not* matching the comma, whereas before it
    first tried to match it. As with the space above, when the comma is
    not matched, it finds that it has a valid match anyway if it just
    stops matching immediately. So it does not need to backtrack, and in
    this case it ends up terminating each match early upon the comma and
    returning all three numbers as separate matches.

    What exactly is it that you're trying to do with this regular
    expression? I suspect that it the solution actually a lot simpler
    than you're making it.
    Ian Kelly, Nov 16, 2012
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Jay Douglas
    Replies:
    0
    Views:
    592
    Jay Douglas
    Aug 15, 2003
  2. =?Utf-8?B?Sm9l?=

    Need help understanding regular expression

    =?Utf-8?B?Sm9l?=, Aug 3, 2005, in forum: ASP .Net
    Replies:
    3
    Views:
    2,094
  3. Joe
    Replies:
    2
    Views:
    187
    Tim Slattery
    Aug 3, 2005
  4. Bruce McGoveran

    Understanding Boolean Expressions

    Bruce McGoveran, Apr 16, 2013, in forum: Python
    Replies:
    7
    Views:
    89
    Jussi Piitulainen
    Apr 17, 2013
  5. Noman Shapiro
    Replies:
    0
    Views:
    219
    Noman Shapiro
    Jul 17, 2013
Loading...

Share This Page