regex confusion

Discussion in 'Python' started by John Hunter, Dec 9, 2003.

  1. John Hunter

    John Hunter Guest

    In trying to sdebug why a certain regex wasn't working like I expected
    it to, I came across this strange (to me) behavior. The file I am
    trying to match definitely contains many instances of the letter 'a',
    so I would expect the regex

    rgxPrev = re.compile('.*?a.*?')

    to match it the string contents of the file. But it doesn't. Here is
    a complete example

    import re, urllib
    rgxPrev = re.compile('.*?a.*?')

    url = 'http://nitace.bsd.uchicago.edu:8080/files/share/showdown_example2.html'
    s = urllib.urlopen(url).read()
    m = rgxPrev.match(s)
    print m
    print s.find('a')

    m is None (no match) and the s.find('a') reports an 'a' at index 48.

    I read the regex to mean non-greedy match of anything up to an a,
    followed by non-greedy match of anything following an a, which this
    file should match.

    Or am I insane?

    John Hunter


    hunter:~/python/projects/poker/data/pokerroom> uname -a
    Linux hunter.paradise.lost 2.4.20-8smp #1 SMP Thu Mar 13 17:45:54 EST 2003 i686
    i686 i386 GNU/Linux
    hunter:~/python/projects/poker/data/pokerroom> python
    Python 2.3.2 (#1, Oct 13 2003, 11:33:15)
    [GCC 3.3.1] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    Welcome to rlcompleter2 0.95
    for nice experiences hit <tab> multiple times
    John Hunter, Dec 9, 2003
    #1
    1. Advertising

  2. MAybe you meant:
    import re, urllib
    rgxPrev = re.compile('.*?a.*?')

    url =
    'http://nitace.bsd.uchicago.edu:8080/files/share/showdown_example2.html'
    s = urllib.urlopen(url).read()
    ***m = match(rgxPrev,s)***
    print m
    print s.find('a')

    match takes two arguments

    "John Hunter" <> wrote in message
    news:...
    >
    > In trying to sdebug why a certain regex wasn't working like I expected
    > it to, I came across this strange (to me) behavior. The file I am
    > trying to match definitely contains many instances of the letter 'a',
    > so I would expect the regex
    >
    > rgxPrev = re.compile('.*?a.*?')
    >
    > to match it the string contents of the file. But it doesn't. Here is
    > a complete example
    >
    > import re, urllib
    > rgxPrev = re.compile('.*?a.*?')
    >
    > url =

    'http://nitace.bsd.uchicago.edu:8080/files/share/showdown_example2.html'
    > s = urllib.urlopen(url).read()
    > m = rgxPrev.match(s)
    > print m
    > print s.find('a')
    >
    > m is None (no match) and the s.find('a') reports an 'a' at index 48.
    >
    > I read the regex to mean non-greedy match of anything up to an a,
    > followed by non-greedy match of anything following an a, which this
    > file should match.
    >
    > Or am I insane?
    >
    > John Hunter
    >
    >
    > hunter:~/python/projects/poker/data/pokerroom> uname -a
    > Linux hunter.paradise.lost 2.4.20-8smp #1 SMP Thu Mar 13 17:45:54 EST 2003

    i686
    > i686 i386 GNU/Linux
    > hunter:~/python/projects/poker/data/pokerroom> python
    > Python 2.3.2 (#1, Oct 13 2003, 11:33:15)
    > [GCC 3.3.1] on linux2
    > Type "help", "copyright", "credits" or "license" for more information.
    > Welcome to rlcompleter2 0.95
    > for nice experiences hit <tab> multiple times
    >
    >
    Luther Barnum, Dec 9, 2003
    #2
    1. Advertising

  3. John Hunter wrote:

    >
    > In trying to sdebug why a certain regex wasn't working like I expected
    > it to, I came across this strange (to me) behavior. The file I am
    > trying to match definitely contains many instances of the letter 'a',
    > so I would expect the regex
    >
    > rgxPrev = re.compile('.*?a.*?')


    This is a bogus regex - a '*' means "zero or more occurences" for the
    expression to the left. '?' means "zero or one occurence" for the exp to
    the left. I'm not exactly sure why this is not working, but its definitely
    redundant. Eliminiating the redundancy gives you this:

    rgxPrev = re.compile('.*a.*')

    Works perfect.

    Regards,

    Diez
    Diez B. Roggisch, Dec 9, 2003
    #3
  4. On Tue, 09 Dec 2003 09:43:24 -0600,
    John Hunter <> wrote:
    > rgxPrev = re.compile('.*?a.*?')


    .. doesn't match newlines unless you specify the re.DOTALL / (?s) flag, so it
    won't match unless 'a' is on the very first line. Add (?s) to your
    expression, and it should work (though it'll be much slower than the .find()
    method).

    --amk
    A.M. Kuchling, Dec 9, 2003
    #4
  5. John Hunter

    Peter Hansen Guest

    "Diez B. Roggisch" wrote:
    >
    > John Hunter wrote:
    >
    > >
    > > In trying to sdebug why a certain regex wasn't working like I expected
    > > it to, I came across this strange (to me) behavior. The file I am
    > > trying to match definitely contains many instances of the letter 'a',
    > > so I would expect the regex
    > >
    > > rgxPrev = re.compile('.*?a.*?')

    >
    > This is a bogus regex - a '*' means "zero or more occurences" for the
    > expression to the left. '?' means "zero or one occurence" for the exp to
    > the left.


    Not true. See http://www.python.org/doc/current/lib/re-syntax.html :

    *?, +?, ??
    The "*", "+", and "?" qualifiers are all greedy; they match as much text
    as possible. .... Adding "?" after the qualifier makes it perform the match
    in non-greedy or minimal fashion; as few characters as possible will be
    matched. ....

    -Peter
    Peter Hansen, Dec 9, 2003
    #5
  6. John Hunter

    Peter Otten Guest

    John Hunter wrote:

    >
    > In trying to sdebug why a certain regex wasn't working like I expected
    > it to, I came across this strange (to me) behavior. The file I am
    > trying to match definitely contains many instances of the letter 'a',
    > so I would expect the regex
    >
    > rgxPrev = re.compile('.*?a.*?')
    >
    > to match it the string contents of the file. But it doesn't. Here is


    [...]

    > I read the regex to mean non-greedy match of anything up to an a,
    > followed by non-greedy match of anything following an a, which this
    > file should match.


    There is a nice example where non-greedy regexes are really useful in A. M.
    Kuchling's Regex Howto (http://www.amk.ca/python/howto/regex/regex.html)

    > Or am I insane?


    This may be off-topic, but the easiest if not fastest way to find multiple
    occurences of a string in a text is:

    >>> import re
    >>> r = re.compile("a")
    >>> for m in r.finditer("abca\na"):

    .... print m.start()
    ....
    0
    3
    5
    >>>


    Peter
    Peter Otten, Dec 9, 2003
    #6
  7. >> This is a bogus regex - a '*' means "zero or more occurences" for the
    >> expression to the left. '?' means "zero or one occurence" for the exp to
    >> the left.

    >
    > Not true. See http://www.python.org/doc/current/lib/re-syntax.html :
    >
    > *?, +?, ??
    > The "*", "+", and "?" qualifiers are all greedy; they match as much text
    > as possible. .... Adding "?" after the qualifier makes it perform the
    > match in non-greedy or minimal fashion; as few characters as possible will
    > be matched. ....


    Hmm. But when thats true, what does ".??" then mean - the first ? is not
    greedy, so it is nothing matched at all. The same is true for ".*?", and
    ".+?" is then equal to "." So what makes this useful? The regex in question
    definitely didn't work with it.

    Diez
    Diez B. Roggisch, Dec 9, 2003
    #7

  8. > Hmm. But when thats true, what does ".??" then mean - the first ? is not
    > greedy, so it is nothing matched at all. The same is true for ".*?", and
    > ".+?" is then equal to "." So what makes this useful? The regex in
    > question definitely didn't work with it.


    Ok - I just found out - it makes sense when taking into account what follows
    in the regex, as that will be matched earlier. Neat - didn't know that such
    things existed.

    Diez
    Diez B. Roggisch, Dec 9, 2003
    #8
  9. John Hunter

    John Hunter Guest

    >>>>> "Peter" == Peter Otten <> writes:

    Peter> This may be off-topic, but the easiest if not fastest way
    Peter> to find multiple occurences of a string in a text is:

    Right, I actually am using regex matching and not literal char
    matching, but in trying to debug why my regex wasn't working, I
    simplified it to the simplest case I could, which was a string
    literal.

    Thanks for the DOTALL pointer above.

    JDH
    John Hunter, Dec 9, 2003
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. =?Utf-8?B?SmViQnVzaGVsbA==?=

    Is ASP Validator Regex Engine Same As VS2003 Find Regex Engine?

    =?Utf-8?B?SmViQnVzaGVsbA==?=, Oct 22, 2005, in forum: ASP .Net
    Replies:
    2
    Views:
    688
    =?Utf-8?B?SmViQnVzaGVsbA==?=
    Oct 22, 2005
  2. Rick Venter

    perl regex to java regex

    Rick Venter, Oct 29, 2003, in forum: Java
    Replies:
    5
    Views:
    1,604
    Ant...
    Nov 6, 2003
  3. Replies:
    3
    Views:
    728
    Reedick, Andrew
    Jul 1, 2008
  4. Regex confusion

    , Feb 19, 2007, in forum: Perl Misc
    Replies:
    6
    Views:
    87
    Tad McClellan
    Feb 20, 2007
  5. guthrie

    Regex confusion...

    guthrie, Sep 27, 2007, in forum: Perl Misc
    Replies:
    6
    Views:
    101
    comp.llang.perl.moderated
    Sep 28, 2007
Loading...

Share This Page