Regular expression help

D

David Lees

I forget how to find multiple instances of stuff between tags using
regular expressions. Specifically I want to find all the text between a
series of begin/end pairs in a multiline file.

I tried:
and got everything between the first begin and last end. I guess
because of a greedy match. What I want to do is a list where each
element is the text between another begin/end pair.

TIA

David Lees
 
F

Fredrik Lundh

David said:
I forget how to find multiple instances of stuff between tags using
regular expressions. Specifically I want to find all the text between a
series of begin/end pairs in a multiline file.

I tried:

and got everything between the first begin and last end. I guess
because of a greedy match. What I want to do is a list where each
element is the text between another begin/end pair.

people will tell you to use non-greedy matches, but that's often a
bad idea in cases like this: the RE engine has to store lots of back-
tracking information, and your program will consume a lot more
memory than it has to (and may run out of stack and/or memory).

a better approach is to do two searches: first search for a "begin",
and once you've found that, look for an "end"

import re

pos = 0

START = re.compile("begin")
END = re.compile("end")

while 1:
m = START.search(text, pos)
if not m:
break
start = m.end()
m = END.search(text, start)
if not m:
break
end = m.start()
process(text[start:end])
pos = m.end() # move forward

at this point, it's also obvious that you don't really have to use
regular expressions:

pos = 0

while 1:
start = text.find("begin", pos)
if start < 0:
break
start += 5
end = text.find("end", start)
if end < 0:
break
process(text[start:end])
pos = end # move forward

</F>

<!-- (the eff-bot guide to) the python standard library (redux):
http://effbot.org/zone/librarybook-index.htm
-->
 
B

Bengt Richter

I forget how to find multiple instances of stuff between tags using
regular expressions. Specifically I want to find all the text between a
series of begin/end pairs in a multiline file.

I tried:

and got everything between the first begin and last end. I guess
because of a greedy match. What I want to do is a list where each
element is the text between another begin/end pair.
You were close. For non-greedy add the question mark after the greedy expression:
... begin first end
... begin
... second
... end
... begin problem begin nested end end
... begin last end
... """ [' first ', '\nsecond\n', ' problem begin nested ', ' last ']

Notice what happened with the nested begin-ends. If you have nesting, you
will need more than a simple regex approach.

Regards,
Bengt Richter
 
Y

yaipa h.

Fredrik,

Not sure about the original poster, but I can use that. Thanks!

--Alan

Fredrik Lundh said:
David said:
I forget how to find multiple instances of stuff between tags using
regular expressions. Specifically I want to find all the text between a
series of begin/end pairs in a multiline file.

I tried:

and got everything between the first begin and last end. I guess
because of a greedy match. What I want to do is a list where each
element is the text between another begin/end pair.

people will tell you to use non-greedy matches, but that's often a
bad idea in cases like this: the RE engine has to store lots of back-
tracking information, and your program will consume a lot more
memory than it has to (and may run out of stack and/or memory).

a better approach is to do two searches: first search for a "begin",
and once you've found that, look for an "end"

import re

pos = 0

START = re.compile("begin")
END = re.compile("end")

while 1:
m = START.search(text, pos)
if not m:
break
start = m.end()
m = END.search(text, start)
if not m:
break
end = m.start()
process(text[start:end])
pos = m.end() # move forward

at this point, it's also obvious that you don't really have to use
regular expressions:

pos = 0

while 1:
start = text.find("begin", pos)
if start < 0:
break
start += 5
end = text.find("end", start)
if end < 0:
break
process(text[start:end])
pos = end # move forward

</F>

<!-- (the eff-bot guide to) the python standard library (redux):
http://effbot.org/zone/librarybook-index.htm
-->
 
B

Bengt Richter

people will tell you to use non-greedy matches, but that's often a
bad idea in cases like this: the RE engine has to store lots of back-
would you say so for this case? Or how like this case?
tracking information, and your program will consume a lot more
memory than it has to (and may run out of stack and/or memory).
For the above case, wouldn't the regex compile to a state machine
that just has a few states to recognize e out of .* and then revert to .*
if the next is not n, and if it is, then look for d similarly, and if not,
revert to .*, etc or finish? For a short terminating match, it would seem
relatively cheap?
at this point, it's also obvious that you don't really have to use
regular expressions:

pos = 0

while 1:
start = text.find("begin", pos)
if start < 0:
break
start += 5
end = text.find("end", start)
if end < 0:
break
process(text[start:end])
pos = end # move forward

</F>

Or breaking your loop with an exception instead of tests:
... sdfsdf
... begin s2 end
... """
... end = 0 # end of previous search
... while 1:
... start = text.index("begin", end) + 5
... end = text.index("end", start)
... process(text[start:end])
... except ValueError:
... pass
...
processing(' s1 ')
processing(' s2 ')

Or if you're guaranteed that every begin has an end, you could also write
>>> for begxxx in text.split('begin')[1:]:
... process(begxxx.split('end')[0])
...
processing(' s1 ')
processing(' s2 ')


Regards,
Bengt Richter
 
D

David Lees

Andrew said:
I forget how to find multiple instances of stuff between tags using
regular expressions. Specifically I want to find all the text between a

^^^^^^^^

How about re.findall?

E.g.:
[' foo ', ' bar ']

-Andrew.

Actually this fails with the multi-line type of file I was asking about.
[' bar ']
 
B

Bengt Richter

Andrew said:
I forget how to find multiple instances of stuff between tags using
regular expressions. Specifically I want to find all the text between a

^^^^^^^^

How about re.findall?

E.g.:
re.findall('BEGIN(.*?)END', 'BEGIN foo END BEGIN bar END')
[' foo ', ' bar ']

-Andrew.

Actually this fails with the multi-line type of file I was asking about.
[' bar ']
It works if you include the DOTALL flag (?s) at the beginning, which makes
.. also match \n: (BTW, (?si) would make it case-insensitive).
[' foo\nmumble ', ' bar ']

Regards,
Bengt Richter
 
D

David Lees

Bengt said:
Andrew said:
On Thu, Jul 17, 2003 at 04:27:23AM +0000, David Lees wrote:


I forget how to find multiple instances of stuff between tags using
regular expressions. Specifically I want to find all the text between a

^^^^^^^^

How about re.findall?

E.g.:

re.findall('BEGIN(.*?)END', 'BEGIN foo END BEGIN bar END')
[' foo ', ' bar ']

-Andrew.

Actually this fails with the multi-line type of file I was asking about.

re.findall('BEGIN(.*?)END', 'BEGIN foo\nmumble END BEGIN bar END')

[' bar ']

It works if you include the DOTALL flag (?s) at the beginning, which makes
. also match \n: (BTW, (?si) would make it case-insensitive).
[' foo\nmumble ', ' bar ']

Regards,
Bengt Richter
I just tried to benchmark both Fredrik's suggestions along with Bengt's
using the same input file. The results (looping 200 times over the 400k
file) are:
Fredrik, regex = 1.74003930667
Fredrik, no regex = 0.434207978947
Bengt, regex = 1.45420158149

Interesting how much faster the non-regex approach is.

Thanks again.

David Lees

The code (which I have not carefully checked) is:

import re, time

def timeBengt(s,N):
p = 'begin msc(.*?)end msc'
rx =re.compile(p,re.DOTALL)
t0 = time.clock()
for i in xrange(N):
x = x = rx.findall(s)
t1 = time.clock()
return t1-t0

def timeFredrik1(text,N):
t0 = time.clock()
for i in xrange(N):
pos = 0

START = re.compile("begin")
END = re.compile("end")

while 1:
m = START.search(text, pos)
if not m:
break
start = m.end()
m = END.search(text, start)
if not m:
break
end = m.start()
pass
pos = m.end() # move forward
t1 = time.clock()
return t1-t0


def timeFredrik(text,N):
t0 = time.clock()
for i in xrange(N):
pos = 0
while 1:
start = text.find("begin msc", pos)
if start < 0:
break
start += 9
end = text.find("end msc", start)
if end < 0:
break
pass
pos = end # move forward

t1 = time.clock()
return t1-t0

fh = open('scu.cfg','rb')
s = fh.read()
fh.close()

N = 200
print 'Fredrik, regex = ',timeFredrik1(s,N)
print 'Fredrik, no regex = ',timeFredrik(s,N)
print 'Bengt, regex = ',timeBengt(s,N)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,534
Members
45,008
Latest member
Rahul737

Latest Threads

Top