Akward code using multiple regexp searches

T

Topher Cawlfield

Hi,

I'm relatively new to Python, and I already love it even after several
years of writing Perl. But a few times already I've found myself
writing the following bit of awkward code when parsing text files. Can
anyone suggest a more elegant solution?

rexp1 = re.compile(r'blah(dee)blah')
rexp2 = re.compile(r'hum(dum)')
for line in inFile:
reslt = rexp1.search(line)
if reslt:
something = reslt.group(1)
else:
reslt = rexp2.search(line)
if reslt:
somethingElse = reslt.group(1)

I'm getting more and more nested if statements, which gets ugly and very
hard to follow after the fourth or fifth regexp search.

Equivalent Perl code is more compact but more importantly seems to
communicate the process of searching for multiple regular expressions
more clearly:

while (<IN>) {
if (/blah(dee)blah/) {
$something = $1;
} elsif (/hum(dum)/) {
$somethingElse = $1;
}
}

I'm a little bit worried about doing the following in Python, since I'm
not sure if the compiler is smart enough to avoid doing each regexp
search twice:

for line in inFile:
if rexp1.search(line)
something = rexp1.search(line).group(1)
elif rexp2.search(line):
somethingElse = rexp2.search(line).group(1)

In many cases I am worried about efficiency as these scripts parse a
couple GB of text!

Does anyone have a suggestion for cleaning up this commonplace Python
code construction?

Thanks,
Topher Cawlfield
 
S

Steven Bethard

Topher Cawlfield said:
Can anyone suggest a more elegant solution?

Does this do what you want?
rexp1 = re.compile(r'blah(dee)blah')
rexp2 = re.compile(r'hum(dum)')
for s in ['blahdeeblah', 'blah blah', 'humdum humdum']:
.... result = rexp1.findall(s) or rexp2.findall(s) or [None]
.... print repr(result[0])
....
'dee'
None
'dum'

The findall function returns all matches of the re in the string, or an empty
list if there were no matches. So if the first findall fails, the or-
statement will then execute the second findall, and if that one fails, the
default value None will be supplied. Note that findall returns a list of the
matches, hence why I have to extract the first element of the list at the end.
I'm a little bit worried about doing the following in Python, since I'm
not sure if the compiler is smart enough to avoid doing each regexp
search twice:

for line in inFile:
if rexp1.search(line)
something = rexp1.search(line).group(1)
elif rexp2.search(line):
somethingElse = rexp2.search(line).group(1)

You're right here - Python will call the method twice (and therefore search
the string twice). It has no way of knowing that these two calls to the same
method will actually return the same results. (In general, there are no
guarantees that calling a method with the same parameters will return the same
result -- for example, file.read(100))

Steve
 
A

Andrew Dalke

Topher said:
> But a few times already I've found myself
writing the following bit of awkward code when parsing text files. Can
anyone suggest a more elegant solution?

rexp1 = re.compile(r'blah(dee)blah')
rexp2 = re.compile(r'hum(dum)')
for line in inFile:
reslt = rexp1.search(line)
if reslt:
something = reslt.group(1)
else:
reslt = rexp2.search(line)
if reslt:
somethingElse = reslt.group(1)

I usually solve this given case with a 'continue'

for line in inFile:
reslt = rexp1.search(line)
if reslt:
something = reslt.group(1)
continue
reslt = rexp2.search(line)
if reslt:
somethingElse = reslt.group(1)
continue

Still more cumbersome than the Perl equivalent.

You could do a trick like this


import re

class Match:
def __init__(self, pattern, flags=0):
self.pat = re.compile(pattern, flags)
self.m = None
def __call__(self, s):
self.m = self.pat.match(s)
return bool(self.m)
def __nonzero__(self):
return bool(self.m)
def group(self, x):
return self.m.group(x)
def start(self, x):
return self.m.start(x)
def end(self, x):
return self.m.end(x)

pat1 = Match("A(.*)")
pat2 = Match("BA(.*)")
pat3 = Match("BB(.*)")

def test(s):
if pat1(s): print "Looks like", pat1.group(1)
elif pat2(s): print "no, it is", pat2.group(1)
elif pat3(s): print "really?", pat3.group(1)
else: print "Never mind."


This is much more along the lines of what you want
but it conflates the idea of search object and
match object and makes your code more suspectible
to subtle breaks. Consider

digits = Match("(\s*(\d+)\s*)")

def divisor(s):
if s[:1] == "/":
if digits(s[1:]):
return int(digits.group(2))
raise TypeError("nothing after the /")
# no fraction, use 1 as the divisor
return 1


def fraction(s):
if digits(s):
denom = divisor(s[digits.end(1):])
return int(digits.group(2)), denom
raise TypeError("does not start with a number")


But as a Perl programmer you are perhaps used to this
because Perl does the same conflation thus having
the same problems. (I think. It's been a while ...
Nope! The regexp search results appear to be my
variables now. When I started with perl4 all variables
were either global or "dynamically scoped"-ish with
local)


I'm a little bit worried about doing the following in Python, since I'm
not sure if the compiler is smart enough to avoid doing each regexp
search twice:

for line in inFile:
if rexp1.search(line)
something = rexp1.search(line).group(1)
elif rexp2.search(line):
somethingElse = rexp2.search(line).group(1)

In many cases I am worried about efficiency as these scripts parse a
couple GB of text!

It isn't smart enough. To make it that smart would require
a lot more work. For example, how does it know that the
implementation of "rexp1.search(line)" always returns the
same value? Or even that "rexp1.search" returns the
same bound method?

Andrew
(e-mail address removed)
 
J

Jason Lai

Topher said:
Hi,

I'm relatively new to Python, and I already love it even after several
years of writing Perl. But a few times already I've found myself
writing the following bit of awkward code when parsing text files. Can
anyone suggest a more elegant solution?

rexp1 = re.compile(r'blah(dee)blah')
rexp2 = re.compile(r'hum(dum)')
for line in inFile:
reslt = rexp1.search(line)
if reslt:
something = reslt.group(1)
else:
reslt = rexp2.search(line)
if reslt:
somethingElse = reslt.group(1)

I'm getting more and more nested if statements, which gets ugly and very
hard to follow after the fourth or fifth regexp search.

Equivalent Perl code is more compact but more importantly seems to
communicate the process of searching for multiple regular expressions
more clearly:

while (<IN>) {
if (/blah(dee)blah/) {
$something = $1;
} elsif (/hum(dum)/) {
$somethingElse = $1;
}
}

I'm a little bit worried about doing the following in Python, since I'm
not sure if the compiler is smart enough to avoid doing each regexp
search twice:

for line in inFile:
if rexp1.search(line)
something = rexp1.search(line).group(1)
elif rexp2.search(line):
somethingElse = rexp2.search(line).group(1)

In many cases I am worried about efficiency as these scripts parse a
couple GB of text!

Does anyone have a suggestion for cleaning up this commonplace Python
code construction?

Thanks,
Topher Cawlfield

Does it have to be stored in a different variable? If you have a list of
regexs and you want to see if any of them match, you could create a
compound regex such as "blah(dee)blah|hum(dum)" and search for that
(although you have to be careful about overlaps.)

- Jason Lai
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,770
Messages
2,569,583
Members
45,072
Latest member
trafficcone

Latest Threads

Top