Regular expression help

Discussion in 'Python' started by David Lees, Jul 17, 2003.

  1. David Lees

    David Lees Guest

    I forget how to find multiple instances of stuff between tags using
    regular expressions. Specifically I want to find all the text between a
    series of begin/end pairs in a multiline file.

    I tried:
    and got everything between the first begin and last end. I guess
    because of a greedy match. What I want to do is a list where each
    element is the text between another begin/end pair.

    TIA

    David Lees
     
    David Lees, Jul 17, 2003
    #1
    1. Advertisements

  2. people will tell you to use non-greedy matches, but that's often a
    bad idea in cases like this: the RE engine has to store lots of back-
    tracking information, and your program will consume a lot more
    memory than it has to (and may run out of stack and/or memory).

    a better approach is to do two searches: first search for a "begin",
    and once you've found that, look for an "end"

    import re

    pos = 0

    START = re.compile("begin")
    END = re.compile("end")

    while 1:
    m = START.search(text, pos)
    if not m:
    break
    start = m.end()
    m = END.search(text, start)
    if not m:
    break
    end = m.start()
    process(text[start:end])
    pos = m.end() # move forward

    at this point, it's also obvious that you don't really have to use
    regular expressions:

    pos = 0

    while 1:
    start = text.find("begin", pos)
    if start < 0:
    break
    start += 5
    end = text.find("end", start)
    if end < 0:
    break
    process(text[start:end])
    pos = end # move forward

    </F>

    <!-- (the eff-bot guide to) the python standard library (redux):
    http://effbot.org/zone/librarybook-index.htm
    -->
     
    Fredrik Lundh, Jul 17, 2003
    #2
    1. Advertisements

  3. You were close. For non-greedy add the question mark after the greedy expression:
    ... begin first end
    ... begin
    ... second
    ... end
    ... begin problem begin nested end end
    ... begin last end
    ... """ [' first ', '\nsecond\n', ' problem begin nested ', ' last ']

    Notice what happened with the nested begin-ends. If you have nesting, you
    will need more than a simple regex approach.

    Regards,
    Bengt Richter
     
    Bengt Richter, Jul 17, 2003
    #3
  4. David Lees

    yaipa h. Guest

    Fredrik,

    Not sure about the original poster, but I can use that. Thanks!

    --Alan

     
    yaipa h., Jul 17, 2003
    #4
  5. would you say so for this case? Or how like this case?
    For the above case, wouldn't the regex compile to a state machine
    that just has a few states to recognize e out of .* and then revert to .*
    if the next is not n, and if it is, then look for d similarly, and if not,
    revert to .*, etc or finish? For a short terminating match, it would seem
    relatively cheap?
    Or breaking your loop with an exception instead of tests:
    ... sdfsdf
    ... begin s2 end
    ... """
    ... end = 0 # end of previous search
    ... while 1:
    ... start = text.index("begin", end) + 5
    ... end = text.index("end", start)
    ... process(text[start:end])
    ... except ValueError:
    ... pass
    ...
    processing(' s1 ')
    processing(' s2 ')

    Or if you're guaranteed that every begin has an end, you could also write
    ... process(begxxx.split('end')[0])
    ...
    processing(' s1 ')
    processing(' s2 ')


    Regards,
    Bengt Richter
     
    Bengt Richter, Jul 17, 2003
    #5
  6. David Lees

    David Lees Guest

    Actually this fails with the multi-line type of file I was asking about.
    [' bar ']
     
    David Lees, Jul 18, 2003
    #6
  7. It works if you include the DOTALL flag (?s) at the beginning, which makes
    .. also match \n: (BTW, (?si) would make it case-insensitive).
    [' foo\nmumble ', ' bar ']

    Regards,
    Bengt Richter
     
    Bengt Richter, Jul 18, 2003
    #7
  8. David Lees

    David Lees Guest

    I just tried to benchmark both Fredrik's suggestions along with Bengt's
    using the same input file. The results (looping 200 times over the 400k
    file) are:
    Fredrik, regex = 1.74003930667
    Fredrik, no regex = 0.434207978947
    Bengt, regex = 1.45420158149

    Interesting how much faster the non-regex approach is.

    Thanks again.

    David Lees

    The code (which I have not carefully checked) is:

    import re, time

    def timeBengt(s,N):
    p = 'begin msc(.*?)end msc'
    rx =re.compile(p,re.DOTALL)
    t0 = time.clock()
    for i in xrange(N):
    x = x = rx.findall(s)
    t1 = time.clock()
    return t1-t0

    def timeFredrik1(text,N):
    t0 = time.clock()
    for i in xrange(N):
    pos = 0

    START = re.compile("begin")
    END = re.compile("end")

    while 1:
    m = START.search(text, pos)
    if not m:
    break
    start = m.end()
    m = END.search(text, start)
    if not m:
    break
    end = m.start()
    pass
    pos = m.end() # move forward
    t1 = time.clock()
    return t1-t0


    def timeFredrik(text,N):
    t0 = time.clock()
    for i in xrange(N):
    pos = 0
    while 1:
    start = text.find("begin msc", pos)
    if start < 0:
    break
    start += 9
    end = text.find("end msc", start)
    if end < 0:
    break
    pass
    pos = end # move forward

    t1 = time.clock()
    return t1-t0

    fh = open('scu.cfg','rb')
    s = fh.read()
    fh.close()

    N = 200
    print 'Fredrik, regex = ',timeFredrik1(s,N)
    print 'Fredrik, no regex = ',timeFredrik(s,N)
    print 'Bengt, regex = ',timeBengt(s,N)
     
    David Lees, Jul 18, 2003
    #8
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.