unexpected behaviour for python regexp: caret symbol almost useless?

Discussion in 'Python' started by conan, May 28, 2006.

  1. conan

    conan Guest

    This regexp
    '<widget class=".*" id=".*">'

    works well with 'grep' for matching lines of the kind
    <widget class="GtkWindow" id="window1">

    on a XML .glade file

    However that's not true for the re module in python, since this one
    takes the regexp as if were specified this way: '^<widget class=".*"
    id=".*">'

    For some reason regexp on python decide to match from the start of the
    line, no matter if you used or not the caret symbol '^'.

    I have a hard time to note why this regexp wasn't working:
    regexp = re.compile(r'<widget class=".*" id="(.*)">')

    The solution was to consider spaces:
    regexp = re.compile(r'\s*<widget class=".*" id="(.*)">\s*')

    To reproduce behaviour just take a .glade file and this python script:
    <code>
    import re

    glade_file_name = 'some.glade'

    bad_regexp = re.compile(r'<widget class=".*" id="(.*)">')
    good_regexp = re.compile(r'\s*<widget class=".*" id="(.*)">\s*')

    for line in open(glade_file_name):
    if bad_regexp.match(line):
    print 'bad:', line.strip()
    if good_regexp.match(line):
    print 'good:', line.strip()
    </code>

    The thing is i should expected to have to put caret explicitly to tell
    the regexp to match at the start of the line, something like:
    r'^<widget class=".*" id="(.*)">'
    however python regexp is taking care of that for me. This is not a
    desired behaviour for what i know about regexp, but maybe i'm missing
    something.
    conan, May 28, 2006
    #1
    1. Advertising

  2. conan

    Peter Otten Guest

    conan wrote:

    > The thing is i should expected to have to put caret explicitly to tell
    > the regexp to match at the start of the line, something like:
    > r'^<widget class=".*" id="(.*)">'
    > however python regexp is taking care of that for me. This is not a
    > desired behaviour for what i know about regexp, but maybe i'm missing
    > something.


    You want search(), not match().

    http://docs.python.org/lib/matching-searching.html

    Peter
    Peter Otten, May 28, 2006
    #2
    1. Advertising

  3. conan

    Paul McGuire Guest

    "conan" <> wrote in message
    news:...
    > This regexp
    > '<widget class=".*" id=".*">'
    >
    > works well with 'grep' for matching lines of the kind
    > <widget class="GtkWindow" id="window1">
    >
    > on a XML .glade file
    >


    As Peter Otten has already mentioned, this is the difference between the re
    "match" and "search" methods.

    As purely a lateral exercise, here is a pyparsing rendition of your program:

    ------------------------------------
    from pyparsing import makeXMLTags, line

    # define pyparsing patterns for begin and end XML tags
    widgetStart,widgetEnd = makeXMLTags("widget")

    # read the file contents
    glade_file_name = 'some.glade'
    gladeContents = open(glade_file_name).read()

    # scan the input string for matching tags
    for widget,start,end in widgetStart.scanString(gladeContents):
    print "good:", line(start, gladeContents).strip()
    print widget["class"], widget["id"]
    print "Class: %(class)s; Id: %(id)s" % widget
    ------------------------------------
    Not quite an exact match, only the good lines get listed. But also check
    out some of the other capabilities. To do this with re's, you have to
    clutter up the re expression with field names, as in:

    (r'<widget class=(?P<class>".*") id="(?P<id>.*)">')

    The parsing patterns generated by makeXMLTags give dict-like and
    attribute-like access to any attributes included with the tag. If not for
    the unfortunate attribute name "class" (which is a Python keyword), you
    could also reference these values as widget.class and widget.id.

    If you are parsing HTML, there is also a makeHTMLTags method, which creates
    patterns that are less rigid about upper/lower case and other XML
    strictnesses.

    -- Paul
    Paul McGuire, May 28, 2006
    #3
  4. conan

    conan Guest

    Thank you, i have read this but somehow a missed it when the issue
    arose.
    conan, May 29, 2006
    #4
  5. conan

    conan Guest

    Thank you Paul.

    Since the only thing i'm doing is extracting this fields, and have no
    plans to include other stuff, a regexp is fine. However i will take
    into account 'pyparsing' when i need to do more complex parsing.

    As you can see in the example i send, i was trying to get info from a
    glade file, in particular i was tired of doing this everytime i need to
    access a widget:

    some_var = xml.get_widget('some_id')

    (doing this is tiresome when you have more than 10 widgets)

    So i do a little module to have all widgets instanciated as attributes
    of the object, for anyone interested it is on:

    http://www.lugmen.org.ar/~p10n/sources/conan/utilidades/GetWidgets.py

    However is still pretty unmature, since it lacks some checks.
    conan, May 29, 2006
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Richard Philips

    Unexpected python behaviour

    Richard Philips, Nov 28, 2003, in forum: Python
    Replies:
    2
    Views:
    310
    Jay O'Connor
    Nov 28, 2003
  2. Simon Wittber

    Unexpected mod-python behaviour.

    Simon Wittber, Feb 4, 2004, in forum: Python
    Replies:
    1
    Views:
    262
    Bengt Richter
    Feb 6, 2004
  3. Daniel Nogradi
    Replies:
    0
    Views:
    381
    Daniel Nogradi
    Nov 15, 2006
  4. AlienBaby
    Replies:
    1
    Views:
    172
    Peter Otten
    Jul 28, 2011
  5. Joao Silva
    Replies:
    16
    Views:
    359
    7stud --
    Aug 21, 2009
Loading...

Share This Page