Regular expression help

David Lees · Jul 17, 2003

I forget how to find multiple instances of stuff between tags using
regular expressions. Specifically I want to find all the text between a
series of begin/end pairs in a multiline file.

I tried:
and got everything between the first begin and last end. I guess
because of a greedy match. What I want to do is a list where each
element is the text between another begin/end pair.

TIA

David Lees

Fredrik Lundh · Jul 17, 2003

David said:
I forget how to find multiple instances of stuff between tags using
regular expressions. Specifically I want to find all the text between a
series of begin/end pairs in a multiline file.

I tried:

and got everything between the first begin and last end. I guess
because of a greedy match. What I want to do is a list where each
element is the text between another begin/end pair.

people will tell you to use non-greedy matches, but that's often a
bad idea in cases like this: the RE engine has to store lots of back-
tracking information, and your program will consume a lot more
memory than it has to (and may run out of stack and/or memory).

a better approach is to do two searches: first search for a "begin",
and once you've found that, look for an "end"

import re

pos = 0

START = re.compile("begin")
END = re.compile("end")

while 1:
m = START.search(text, pos)
if not m:
break
start = m.end()
m = END.search(text, start)
if not m:
break
end = m.start()
process(text[start:end])
pos = m.end() # move forward

at this point, it's also obvious that you don't really have to use
regular expressions:

pos = 0

while 1:
start = text.find("begin", pos)
if start < 0:
break
start += 5
end = text.find("end", start)
if end < 0:
break
process(text[start:end])
pos = end # move forward

</F>

Bengt Richter · Jul 17, 2003

I forget how to find multiple instances of stuff between tags using
regular expressions. Specifically I want to find all the text between a
series of begin/end pairs in a multiline file.

I tried:

and got everything between the first begin and last end. I guess
because of a greedy match. What I want to do is a list where each
element is the text between another begin/end pair.

You were close. For non-greedy add the question mark after the greedy expression:
... begin first end
... begin
... second
... end
... begin problem begin nested end end
... begin last end
... """ [' first ', '\nsecond\n', ' problem begin nested ', ' last ']

Notice what happened with the nested begin-ends. If you have nesting, you
will need more than a simple regex approach.

Regards,
Bengt Richter

yaipa h. · Jul 17, 2003

Fredrik,

Not sure about the original poster, but I can use that. Thanks!

--Alan

Fredrik Lundh said:
David said:

I forget how to find multiple instances of stuff between tags using
regular expressions. Specifically I want to find all the text between a
series of begin/end pairs in a multiline file.

I tried:

and got everything between the first begin and last end. I guess
because of a greedy match. What I want to do is a list where each
element is the text between another begin/end pair.

Click to expand...

people will tell you to use non-greedy matches, but that's often a
bad idea in cases like this: the RE engine has to store lots of back-
tracking information, and your program will consume a lot more
memory than it has to (and may run out of stack and/or memory).

a better approach is to do two searches: first search for a "begin",
and once you've found that, look for an "end"

import re

pos = 0

START = re.compile("begin")
END = re.compile("end")

while 1:
m = START.search(text, pos)
if not m:
break
start = m.end()
m = END.search(text, start)
if not m:
break
end = m.start()
process(text[start:end])
pos = m.end() # move forward

at this point, it's also obvious that you don't really have to use
regular expressions:

pos = 0

while 1:
start = text.find("begin", pos)
if start < 0:
break
start += 5
end = text.find("end", start)
if end < 0:
break
process(text[start:end])
pos = end # move forward

</F>

Bengt Richter · Jul 17, 2003

people will tell you to use non-greedy matches, but that's often a
bad idea in cases like this: the RE engine has to store lots of back-

would you say so for this case? Or how like this case?

tracking information, and your program will consume a lot more
memory than it has to (and may run out of stack and/or memory).

For the above case, wouldn't the regex compile to a state machine
that just has a few states to recognize e out of .* and then revert to .*
if the next is not n, and if it is, then look for d similarly, and if not,
revert to .*, etc or finish? For a short terminating match, it would seem
relatively cheap?

at this point, it's also obvious that you don't really have to use
regular expressions:

pos = 0

while 1:
start = text.find("begin", pos)
if start < 0:
break
start += 5
end = text.find("end", start)
if end < 0:
break
process(text[start:end])
pos = end # move forward

</F>

Or breaking your loop with an exception instead of tests:
... sdfsdf
... begin s2 end
... """
... end = 0 # end of previous search
... while 1:
... start = text.index("begin", end) + 5
... end = text.index("end", start)
... process(text[start:end])
... except ValueError:
... pass
...
processing(' s1 ')
processing(' s2 ')

Or if you're guaranteed that every begin has an end, you could also write

>>> for begxxx in text.split('begin')[1:]:

Click to expand...

Click to expand...

... process(begxxx.split('end')[0])
...
processing(' s1 ')
processing(' s2 ')

Regards,
Bengt Richter

David Lees · Jul 18, 2003

Andrew said:
I forget how to find multiple instances of stuff between tags using
regular expressions. Specifically I want to find all the text between a

Click to expand...

^^^^^^^^

How about re.findall?

E.g.:
[' foo ', ' bar ']

-Andrew.

Actually this fails with the multi-line type of file I was asking about.
[' bar ']

Bengt Richter · Jul 18, 2003

Andrew said:
Andrew said:

I forget how to find multiple instances of stuff between tags using
regular expressions. Specifically I want to find all the text between a

Click to expand...

^^^^^^^^

How about re.findall?

E.g.:

re.findall('BEGIN(.*?)END', 'BEGIN foo END BEGIN bar END')

Click to expand...

[' foo ', ' bar ']

-Andrew.

Click to expand...

Actually this fails with the multi-line type of file I was asking about.
[' bar ']

It works if you include the DOTALL flag (?s) at the beginning, which makes
.. also match \n: (BTW, (?si) would make it case-insensitive).
[' foo\nmumble ', ' bar ']

Regards,
Bengt Richter

David Lees · Jul 18, 2003

Bengt said:
Andrew said:

On Thu, Jul 17, 2003 at 04:27:23AM +0000, David Lees wrote:

I forget how to find multiple instances of stuff between tags using
regular expressions. Specifically I want to find all the text between a

^^^^^^^^

How about re.findall?

E.g.:

re.findall('BEGIN(.*?)END', 'BEGIN foo END BEGIN bar END')
[' foo ', ' bar ']

-Andrew.

Click to expand...

Actually this fails with the multi-line type of file I was asking about.

re.findall('BEGIN(.*?)END', 'BEGIN foo\nmumble END BEGIN bar END')

Click to expand...

[' bar ']

Click to expand...

It works if you include the DOTALL flag (?s) at the beginning, which makes
. also match \n: (BTW, (?si) would make it case-insensitive).
[' foo\nmumble ', ' bar ']

Regards,
Bengt Richter

I just tried to benchmark both Fredrik's suggestions along with Bengt's
using the same input file. The results (looping 200 times over the 400k
file) are:
Fredrik, regex = 1.74003930667
Fredrik, no regex = 0.434207978947
Bengt, regex = 1.45420158149

Interesting how much faster the non-regex approach is.

Thanks again.

David Lees

The code (which I have not carefully checked) is:

import re, time

def timeBengt(s,N):
p = 'begin msc(.*?)end msc'
rx =re.compile(p,re.DOTALL)
t0 = time.clock()
for i in xrange(N):
x = x = rx.findall(s)
t1 = time.clock()
return t1-t0

def timeFredrik1(text,N):
t0 = time.clock()
for i in xrange(N):
pos = 0

START = re.compile("begin")
END = re.compile("end")

while 1:
m = START.search(text, pos)
if not m:
break
start = m.end()
m = END.search(text, start)
if not m:
break
end = m.start()
pass
pos = m.end() # move forward
t1 = time.clock()
return t1-t0

def timeFredrik(text,N):
t0 = time.clock()
for i in xrange(N):
pos = 0
while 1:
start = text.find("begin msc", pos)
if start < 0:
break
start += 9
end = text.find("end msc", start)
if end < 0:
break
pass
pos = end # move forward

t1 = time.clock()
return t1-t0

fh = open('scu.cfg','rb')
s = fh.read()
fh.close()

N = 200
print 'Fredrik, regex = ',timeFredrik1(s,N)
print 'Fredrik, no regex = ',timeFredrik(s,N)
print 'Bengt, regex = ',timeBengt(s,N)

Regular Expression for the special character "\|" pipe	7	May 27, 2014
Regular Expression Help	3	Apr 12, 2009
Regular expression negative look-ahead	1	Jul 2, 2013
Help with regular expression in python	1	Aug 18, 2011
Regular Expression Help	1	Feb 26, 2008
regular expression extracting groups	3	Aug 10, 2008
Regular expression to structure HTML	11	Oct 2, 2009
Help with regular expression patterns	0	Nov 28, 2008

Regular expression help

David Lees

Fredrik Lundh

Bengt Richter

yaipa h.

Bengt Richter

David Lees

Bengt Richter

David Lees

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads