using re.finditer()

Discussion in 'Python' started by Erik Johnson, Oct 27, 2004.

  1. Erik Johnson

    Erik Johnson Guest

    I am still fairly new to Python and trying to learn to put RE's to good
    use. I am a little confused about the finditer() method. It is documented
    like so:

    finditer( pattern, string)

    Return an iterator over all non-overlapping matches for the RE pattern in
    string. For each match, the iterator returns a match object. Empty matches
    are included in the result unless they touch the beginning of another match.
    New in version 2.2.


    I would say this documentation is not quite right (or incomplete at
    best) because it doesn't document any restriction about multiline matching,
    but it certainly seems to have one. This would seem to be an ideal
    application for finditer()...

    #! /usr/bin/python

    import re

    html = """
    <table>
    <tr>
    <td>Data 1-1</td>
    <td>Data 1-2</td>
    <td>Data 1-3</td>
    </tr>
    <tr>
    <td>Data 2-1</td>
    <td>Data 2-2
    </td>
    <td>Data 2-3</td>
    </tr>
    </table>
    """

    pat = r'<td.*?>(.*?)</td>'
    for match in re.finditer(pat, html):
    print match.group(1)


    The iterator returned seems to work fine to step through items that
    happen to be contained within one line. That is, you can step through flat,
    one-line td's, but if you want to step through tr's, this doesn't work (run
    this code and notice Data 2-2 is not there). finditer() doesn't accept a
    flag like re.DOTALL, as re.match() and re.search() do. It seems a shame not
    to be able to put an otherwise smart design to use.

    One work around to this is by applying a regualr RE like
    '<tr.*?>(.*?)</tr>', finding the first match, and then chopping off the
    found part as you go. Another is to change the pattern to something like:
    pat = r'<td.*?>([\n\w\s\d-]*?)</td>' I guess I will get one of those
    implemented and get this task done, but I am still interested in learning to
    use RE's better.
    Interestingly, using pat = r'<td.*?>([\r\n.]*?)</td>' does NOT work, and I
    don't understand why - can someone explain that?
    What exactly would be the equivalent set for dot with re.DOTALL turned on
    ([\w\W\r\n] works here, but does that really cover it)? Other ideas?

    I can think of split & join substituion tricks & the like to replace \n
    with something else, then put it back in, but that get's kinda messy and
    requires you to find some special substitution character that's not
    elsewhere. I want to apply this to dynamically generated text that I don't
    control and keep this as general as possible. I'm wondering if I'm not
    missing something here - is there no way to make finditer() work
    (straightforwardly) on multilines using just a simple dot RE?

    Thanks for taking the time to read my post! :)

    -ej
    Erik Johnson, Oct 27, 2004
    #1
    1. Advertising

  2. Erik Johnson

    Peter Otten Guest

    Erik Johnson wrote:

    > pat = r'<td.*?>(.*?)</td>'
    > for match in re.finditer(pat, html):
    > print match.group(1)
    >
    >
    > The iterator returned seems to work fine to step through items that
    > happen to be contained within one line. That is, you can step through
    > flat, one-line td's, but if you want to step through tr's, this doesn't
    > work (run this code and notice Data 2-2 is not there). finditer() doesn't
    > accept a flag like re.DOTALL, as re.match() and re.search() do. It seems a
    > shame not to be able to put an otherwise smart design to use.


    There was a discussion on python-dev recently concerning "missing arguments"
    in re.findall() and re.finditer(), see

    http://mail.python.org/pipermail/python-dev/2004-September/048662.html

    I think no change was made as there is already an alternative spelling:

    r = re.compile(r'<td.*?>(.*?)</td>', re.DOTALL)
    for match in r.finditer(html):
    print match.group(1)

    (or two, I didn't know about the option to embed flags in the string until
    Robert Brewer's post).

    Peter
    Peter Otten, Oct 27, 2004
    #2
    1. Advertising

  3. Erik Johnson

    Erik Johnson Guest

    Robert Brewer wrote:

    >Embed the flag(s) you desire in the regex itself. For example, to
    >include DOTALL, change r'<td.*?>(.*?)</td>' to r'(?s)<td.*?>(.*?)</td>'


    Ahhhh! :) Sorry, my bad. Its right there in the docs, but I missed it -
    haven't fully comprehended all of re yet. :)


    Peter Otten wrote:

    >r = re.compile(r'<td.*?>(.*?)</td>', re.DOTALL)
    >for match in r.finditer(html):
    > print match.group(1)


    Good - perhaps a more obvious way to do it.

    So there's two good work-arounds.
    Thank you both for your helpful replies! :)

    I am still left puzzled though, why this won't work:

    pat = r'<td.*?>([\n.]*?)</td>'
    for match in re.finditer(pat, html):
    print match.group(1)

    but this will:

    pat = r'<td.*?>([\w\W]*?)</td>'
    for match in re.finditer(pat, html):
    print match.group(1)

    Thanks,
    -ej
    Erik Johnson, Oct 27, 2004
    #3
  4. Erik Johnson

    Peter Otten Guest

    Erik Johnson wrote:

    > I am still left puzzled though, why this won't work:
    >
    > pat = r'<td.*?>([\n.]*?)</td>'
    > for match in re.finditer(pat, html):
    > print match.group(1)


    >>> re.findall(r"[.\n]", "\nx\n")

    ['\n', '\n']
    >>> re.findall(r"[.\n]", "\n.\n")

    ['\n', '.', '\n']

    It seems a dot inside [] means a dot rather than "any character".

    Peter
    Peter Otten, Oct 27, 2004
    #4
  5. Erik Johnson

    Erik Johnson Guest

    "Peter Otten" wrote in message news:clp4pj$2n2$01$-online.com...

    > >>> re.findall(r"[.\n]", "\nx\n")

    > ['\n', '\n']
    > >>> re.findall(r"[.\n]", "\n.\n")

    > ['\n', '.', '\n']
    >
    > It seems a dot inside [] means a dot rather than "any character".


    DOH! That's right in the docs too. <blush>

    []
    Used to indicate a set of characters. Characters can be listed individually,
    or a range of characters can be indicated by giving two characters and
    separating them by a "-". Special characters are not active inside sets. For
    example, [akm$] will match any of the characters "a", "k", "m", or "$";
    [a-z] will match any lowercase letter, and [a-zA-Z0-9] matches any letter or
    digit. Character classes such as \w or \S (defined below) are also
    acceptable inside a range. If you want to include a "]" or a "-" inside a
    set, precede it with a backslash, or place it as the first character. The
    pattern []] will match ']', for example.


    Like I said, I'm learning. Nothing like experience! :)
    Thanks for your help!

    -ej
    Erik Johnson, Oct 27, 2004
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Robert Oschler

    An ode to re.finditer()

    Robert Oschler, Aug 1, 2004, in forum: Python
    Replies:
    0
    Views:
    343
    Robert Oschler
    Aug 1, 2004
  2. Robert Brewer

    RE: using re.finditer()

    Robert Brewer, Oct 27, 2004, in forum: Python
    Replies:
    0
    Views:
    450
    Robert Brewer
    Oct 27, 2004
  3. Greg Lindstrom

    using re.finditer()

    Greg Lindstrom, Oct 27, 2004, in forum: Python
    Replies:
    0
    Views:
    306
    Greg Lindstrom
    Oct 27, 2004
  4. Chris Lasher
    Replies:
    8
    Views:
    342
    Michael Hoffman
    Dec 18, 2004
  5. Erick
    Replies:
    9
    Views:
    646
    Erick
    Feb 3, 2005
Loading...

Share This Page