Regular expression help

Discussion in 'Python' started by David Lees, Jul 17, 2003.

  1. David Lees

    David Lees Guest

    I forget how to find multiple instances of stuff between tags using
    regular expressions. Specifically I want to find all the text between a
    series of begin/end pairs in a multiline file.

    I tried:
    >>> p = 'begin(.*)end'
    >>> m = re.search(p,s,re.DOTALL)


    and got everything between the first begin and last end. I guess
    because of a greedy match. What I want to do is a list where each
    element is the text between another begin/end pair.

    TIA

    David Lees
     
    David Lees, Jul 17, 2003
    #1
    1. Advertising

  2. David Lees wrote:

    > I forget how to find multiple instances of stuff between tags using
    > regular expressions. Specifically I want to find all the text between a
    > series of begin/end pairs in a multiline file.
    >
    > I tried:
    > >>> p = 'begin(.*)end'
    > >>> m = re.search(p,s,re.DOTALL)

    >
    > and got everything between the first begin and last end. I guess
    > because of a greedy match. What I want to do is a list where each
    > element is the text between another begin/end pair.


    people will tell you to use non-greedy matches, but that's often a
    bad idea in cases like this: the RE engine has to store lots of back-
    tracking information, and your program will consume a lot more
    memory than it has to (and may run out of stack and/or memory).

    a better approach is to do two searches: first search for a "begin",
    and once you've found that, look for an "end"

    import re

    pos = 0

    START = re.compile("begin")
    END = re.compile("end")

    while 1:
    m = START.search(text, pos)
    if not m:
    break
    start = m.end()
    m = END.search(text, start)
    if not m:
    break
    end = m.start()
    process(text[start:end])
    pos = m.end() # move forward

    at this point, it's also obvious that you don't really have to use
    regular expressions:

    pos = 0

    while 1:
    start = text.find("begin", pos)
    if start < 0:
    break
    start += 5
    end = text.find("end", start)
    if end < 0:
    break
    process(text[start:end])
    pos = end # move forward

    </F>

    <!-- (the eff-bot guide to) the python standard library (redux):
    http://effbot.org/zone/librarybook-index.htm
    -->
     
    Fredrik Lundh, Jul 17, 2003
    #2
    1. Advertising

  3. On Thu, 17 Jul 2003 04:27:23 GMT, David Lees <> wrote:

    >I forget how to find multiple instances of stuff between tags using
    >regular expressions. Specifically I want to find all the text between a
    >series of begin/end pairs in a multiline file.
    >
    >I tried:
    > >>> p = 'begin(.*)end'
    > >>> m = re.search(p,s,re.DOTALL)

    >
    >and got everything between the first begin and last end. I guess
    >because of a greedy match. What I want to do is a list where each
    >element is the text between another begin/end pair.
    >

    You were close. For non-greedy add the question mark after the greedy expression:

    >>> import re
    >>> s = """

    ... begin first end
    ... begin
    ... second
    ... end
    ... begin problem begin nested end end
    ... begin last end
    ... """
    >>> p = 'begin(.*?)end'
    >>> rx =re.compile(p,re.DOTALL)
    >>> rx.findall(s)

    [' first ', '\nsecond\n', ' problem begin nested ', ' last ']

    Notice what happened with the nested begin-ends. If you have nesting, you
    will need more than a simple regex approach.

    Regards,
    Bengt Richter
     
    Bengt Richter, Jul 17, 2003
    #3
  4. David Lees

    yaipa h. Guest

    Fredrik,

    Not sure about the original poster, but I can use that. Thanks!

    --Alan

    "Fredrik Lundh" <> wrote in message news:<>...
    > David Lees wrote:
    >
    > > I forget how to find multiple instances of stuff between tags using
    > > regular expressions. Specifically I want to find all the text between a
    > > series of begin/end pairs in a multiline file.
    > >
    > > I tried:
    > > >>> p = 'begin(.*)end'
    > > >>> m = re.search(p,s,re.DOTALL)

    > >
    > > and got everything between the first begin and last end. I guess
    > > because of a greedy match. What I want to do is a list where each
    > > element is the text between another begin/end pair.

    >
    > people will tell you to use non-greedy matches, but that's often a
    > bad idea in cases like this: the RE engine has to store lots of back-
    > tracking information, and your program will consume a lot more
    > memory than it has to (and may run out of stack and/or memory).
    >
    > a better approach is to do two searches: first search for a "begin",
    > and once you've found that, look for an "end"
    >
    > import re
    >
    > pos = 0
    >
    > START = re.compile("begin")
    > END = re.compile("end")
    >
    > while 1:
    > m = START.search(text, pos)
    > if not m:
    > break
    > start = m.end()
    > m = END.search(text, start)
    > if not m:
    > break
    > end = m.start()
    > process(text[start:end])
    > pos = m.end() # move forward
    >
    > at this point, it's also obvious that you don't really have to use
    > regular expressions:
    >
    > pos = 0
    >
    > while 1:
    > start = text.find("begin", pos)
    > if start < 0:
    > break
    > start += 5
    > end = text.find("end", start)
    > if end < 0:
    > break
    > process(text[start:end])
    > pos = end # move forward
    >
    > </F>
    >
    > <!-- (the eff-bot guide to) the python standard library (redux):
    > http://effbot.org/zone/librarybook-index.htm
    > -->
     
    yaipa h., Jul 17, 2003
    #4
  5. On Thu, 17 Jul 2003 08:44:50 +0200, "Fredrik Lundh" <> wrote:

    >David Lees wrote:
    >
    >> I forget how to find multiple instances of stuff between tags using
    >> regular expressions. Specifically I want to find all the text between a
    >> series of begin/end pairs in a multiline file.
    >>
    >> I tried:
    >> >>> p = 'begin(.*)end'
    >> >>> m = re.search(p,s,re.DOTALL)

    >>
    >> and got everything between the first begin and last end. I guess
    >> because of a greedy match. What I want to do is a list where each
    >> element is the text between another begin/end pair.

    >
    >people will tell you to use non-greedy matches, but that's often a
    >bad idea in cases like this: the RE engine has to store lots of back-

    would you say so for this case? Or how like this case?

    >tracking information, and your program will consume a lot more
    >memory than it has to (and may run out of stack and/or memory).

    For the above case, wouldn't the regex compile to a state machine
    that just has a few states to recognize e out of .* and then revert to .*
    if the next is not n, and if it is, then look for d similarly, and if not,
    revert to .*, etc or finish? For a short terminating match, it would seem
    relatively cheap?

    >at this point, it's also obvious that you don't really have to use
    >regular expressions:
    >
    > pos = 0
    >
    > while 1:
    > start = text.find("begin", pos)
    > if start < 0:
    > break
    > start += 5
    > end = text.find("end", start)
    > if end < 0:
    > break
    > process(text[start:end])
    > pos = end # move forward
    >
    ></F>


    Or breaking your loop with an exception instead of tests:

    >>> text = """begin s1 end

    ... sdfsdf
    ... begin s2 end
    ... """

    >>> def process(s): print 'processing(%r)'%s

    ...
    >>> try:

    ... end = 0 # end of previous search
    ... while 1:
    ... start = text.index("begin", end) + 5
    ... end = text.index("end", start)
    ... process(text[start:end])
    ... except ValueError:
    ... pass
    ...
    processing(' s1 ')
    processing(' s2 ')

    Or if you're guaranteed that every begin has an end, you could also write

    >>> for begxxx in text.split('begin')[1:]:

    ... process(begxxx.split('end')[0])
    ...
    processing(' s1 ')
    processing(' s2 ')


    Regards,
    Bengt Richter
     
    Bengt Richter, Jul 17, 2003
    #5
  6. David Lees

    David Lees Guest

    Andrew Bennetts wrote:
    > On Thu, Jul 17, 2003 at 04:27:23AM +0000, David Lees wrote:
    >
    >>I forget how to find multiple instances of stuff between tags using
    >>regular expressions. Specifically I want to find all the text between a

    >
    > ^^^^^^^^
    >
    > How about re.findall?
    >
    > E.g.:
    >
    > >>> re.findall('BEGIN(.*?)END', 'BEGIN foo END BEGIN bar END')

    > [' foo ', ' bar ']
    >
    > -Andrew.
    >
    >


    Actually this fails with the multi-line type of file I was asking about.

    >>> re.findall('BEGIN(.*?)END', 'BEGIN foo\nmumble END BEGIN bar END')

    [' bar ']
     
    David Lees, Jul 18, 2003
    #6
  7. On Fri, 18 Jul 2003 04:31:32 GMT, David Lees <> wrote:

    >Andrew Bennetts wrote:
    >> On Thu, Jul 17, 2003 at 04:27:23AM +0000, David Lees wrote:
    >>
    >>>I forget how to find multiple instances of stuff between tags using
    >>>regular expressions. Specifically I want to find all the text between a

    >>
    >> ^^^^^^^^
    >>
    >> How about re.findall?
    >>
    >> E.g.:
    >>
    >> >>> re.findall('BEGIN(.*?)END', 'BEGIN foo END BEGIN bar END')

    >> [' foo ', ' bar ']
    >>
    >> -Andrew.
    >>
    >>

    >
    >Actually this fails with the multi-line type of file I was asking about.
    >
    > >>> re.findall('BEGIN(.*?)END', 'BEGIN foo\nmumble END BEGIN bar END')

    >[' bar ']
    >

    It works if you include the DOTALL flag (?s) at the beginning, which makes
    .. also match \n: (BTW, (?si) would make it case-insensitive).

    >>> import re
    >>> re.findall('(?s)BEGIN(.*?)END', 'BEGIN foo\nmumble END BEGIN bar END')

    [' foo\nmumble ', ' bar ']

    Regards,
    Bengt Richter
     
    Bengt Richter, Jul 18, 2003
    #7
  8. David Lees

    David Lees Guest

    Bengt Richter wrote:

    > On Fri, 18 Jul 2003 04:31:32 GMT, David Lees <> wrote:
    >
    >
    >>Andrew Bennetts wrote:
    >>
    >>>On Thu, Jul 17, 2003 at 04:27:23AM +0000, David Lees wrote:
    >>>
    >>>
    >>>>I forget how to find multiple instances of stuff between tags using
    >>>>regular expressions. Specifically I want to find all the text between a
    >>>
    >>> ^^^^^^^^
    >>>
    >>>How about re.findall?
    >>>
    >>>E.g.:
    >>>
    >>> >>> re.findall('BEGIN(.*?)END', 'BEGIN foo END BEGIN bar END')
    >>> [' foo ', ' bar ']
    >>>
    >>>-Andrew.
    >>>
    >>>

    >>
    >>Actually this fails with the multi-line type of file I was asking about.
    >>
    >>
    >>>>>re.findall('BEGIN(.*?)END', 'BEGIN foo\nmumble END BEGIN bar END')

    >>
    >>[' bar ']
    >>

    >
    > It works if you include the DOTALL flag (?s) at the beginning, which makes
    > . also match \n: (BTW, (?si) would make it case-insensitive).
    >
    > >>> import re
    > >>> re.findall('(?s)BEGIN(.*?)END', 'BEGIN foo\nmumble END BEGIN bar END')

    > [' foo\nmumble ', ' bar ']
    >
    > Regards,
    > Bengt Richter

    I just tried to benchmark both Fredrik's suggestions along with Bengt's
    using the same input file. The results (looping 200 times over the 400k
    file) are:
    Fredrik, regex = 1.74003930667
    Fredrik, no regex = 0.434207978947
    Bengt, regex = 1.45420158149

    Interesting how much faster the non-regex approach is.

    Thanks again.

    David Lees

    The code (which I have not carefully checked) is:

    import re, time

    def timeBengt(s,N):
    p = 'begin msc(.*?)end msc'
    rx =re.compile(p,re.DOTALL)
    t0 = time.clock()
    for i in xrange(N):
    x = x = rx.findall(s)
    t1 = time.clock()
    return t1-t0

    def timeFredrik1(text,N):
    t0 = time.clock()
    for i in xrange(N):
    pos = 0

    START = re.compile("begin")
    END = re.compile("end")

    while 1:
    m = START.search(text, pos)
    if not m:
    break
    start = m.end()
    m = END.search(text, start)
    if not m:
    break
    end = m.start()
    pass
    pos = m.end() # move forward
    t1 = time.clock()
    return t1-t0


    def timeFredrik(text,N):
    t0 = time.clock()
    for i in xrange(N):
    pos = 0
    while 1:
    start = text.find("begin msc", pos)
    if start < 0:
    break
    start += 9
    end = text.find("end msc", start)
    if end < 0:
    break
    pass
    pos = end # move forward

    t1 = time.clock()
    return t1-t0

    fh = open('scu.cfg','rb')
    s = fh.read()
    fh.close()

    N = 200
    print 'Fredrik, regex = ',timeFredrik1(s,N)
    print 'Fredrik, no regex = ',timeFredrik(s,N)
    print 'Bengt, regex = ',timeBengt(s,N)
     
    David Lees, Jul 18, 2003
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Anand

    Regular Expression help...

    Anand, Jul 9, 2003, in forum: Perl
    Replies:
    1
    Views:
    1,246
    Eric J. Roode
    Jul 9, 2003
  2. Eric B.
    Replies:
    1
    Views:
    440
    Jim Gibson
    Dec 17, 2004
  3. VSK
    Replies:
    2
    Views:
    2,388
  4. =?iso-8859-1?B?bW9vcJk=?=

    Matching abitrary expression in a regular expression

    =?iso-8859-1?B?bW9vcJk=?=, Dec 1, 2005, in forum: Java
    Replies:
    8
    Views:
    884
    Alan Moore
    Dec 2, 2005
  5. GIMME
    Replies:
    3
    Views:
    12,049
    vforvikash
    Dec 29, 2008
Loading...

Share This Page