Akward code using multiple regexp searches

Topher Cawlfield · Sep 10, 2004

Hi,

I'm relatively new to Python, and I already love it even after several
years of writing Perl. But a few times already I've found myself
writing the following bit of awkward code when parsing text files. Can
anyone suggest a more elegant solution?

rexp1 = re.compile(r'blah(dee)blah')
rexp2 = re.compile(r'hum(dum)')
for line in inFile:
reslt = rexp1.search(line)
if reslt:
something = reslt.group(1)
else:
reslt = rexp2.search(line)
if reslt:
somethingElse = reslt.group(1)

I'm getting more and more nested if statements, which gets ugly and very
hard to follow after the fourth or fifth regexp search.

Equivalent Perl code is more compact but more importantly seems to
communicate the process of searching for multiple regular expressions
more clearly:

while (<IN>) {
if (/blah(dee)blah/) {
$something = $1;
} elsif (/hum(dum)/) {
$somethingElse = $1;
}
}

I'm a little bit worried about doing the following in Python, since I'm
not sure if the compiler is smart enough to avoid doing each regexp
search twice:

for line in inFile:
if rexp1.search(line)
something = rexp1.search(line).group(1)
elif rexp2.search(line):
somethingElse = rexp2.search(line).group(1)

In many cases I am worried about efficiency as these scripts parse a
couple GB of text!

Does anyone have a suggestion for cleaning up this commonplace Python
code construction?

Thanks,
Topher Cawlfield

Steven Bethard · Sep 10, 2004

Topher Cawlfield said:
Can anyone suggest a more elegant solution?

Does this do what you want?

rexp1 = re.compile(r'blah(dee)blah')
rexp2 = re.compile(r'hum(dum)')
for s in ['blahdeeblah', 'blah blah', 'humdum humdum']:

Click to expand...

Click to expand...

.... result = rexp1.findall(s) or rexp2.findall(s) or [None]
.... print repr(result[0])
....
'dee'
None
'dum'

The findall function returns all matches of the re in the string, or an empty
list if there were no matches. So if the first findall fails, the or-
statement will then execute the second findall, and if that one fails, the
default value None will be supplied. Note that findall returns a list of the
matches, hence why I have to extract the first element of the list at the end.

I'm a little bit worried about doing the following in Python, since I'm
not sure if the compiler is smart enough to avoid doing each regexp
search twice:

for line in inFile:
if rexp1.search(line)
something = rexp1.search(line).group(1)
elif rexp2.search(line):
somethingElse = rexp2.search(line).group(1)

You're right here - Python will call the method twice (and therefore search
the string twice). It has no way of knowing that these two calls to the same
method will actually return the same results. (In general, there are no
guarantees that calling a method with the same parameters will return the same
result -- for example, file.read(100))

Steve

Andrew Dalke · Sep 10, 2004

Topher said:
> But a few times already I've found myself
writing the following bit of awkward code when parsing text files. Can
anyone suggest a more elegant solution?

rexp1 = re.compile(r'blah(dee)blah')
rexp2 = re.compile(r'hum(dum)')
for line in inFile:
reslt = rexp1.search(line)
if reslt:
something = reslt.group(1)
else:
reslt = rexp2.search(line)
if reslt:
somethingElse = reslt.group(1)

I usually solve this given case with a 'continue'

for line in inFile:
reslt = rexp1.search(line)
if reslt:
something = reslt.group(1)
continue
reslt = rexp2.search(line)
if reslt:
somethingElse = reslt.group(1)
continue

Still more cumbersome than the Perl equivalent.

You could do a trick like this

import re

class Match:
def __init__(self, pattern, flags=0):
self.pat = re.compile(pattern, flags)
self.m = None
def __call__(self, s):
self.m = self.pat.match(s)
return bool(self.m)
def __nonzero__(self):
return bool(self.m)
def group(self, x):
return self.m.group(x)
def start(self, x):
return self.m.start(x)
def end(self, x):
return self.m.end(x)

pat1 = Match("A(.*)")
pat2 = Match("BA(.*)")
pat3 = Match("BB(.*)")

def test(s):
if pat1(s): print "Looks like", pat1.group(1)
elif pat2(s): print "no, it is", pat2.group(1)
elif pat3(s): print "really?", pat3.group(1)
else: print "Never mind."

This is much more along the lines of what you want
but it conflates the idea of search object and
match object and makes your code more suspectible
to subtle breaks. Consider

digits = Match("(\s*(\d+)\s*)")

def divisor(s):
if s[:1] == "/":
if digits(s[1:]):
return int(digits.group(2))
raise TypeError("nothing after the /")
# no fraction, use 1 as the divisor
return 1

def fraction(s):
if digits(s):
denom = divisor(s[digits.end(1):])
return int(digits.group(2)), denom
raise TypeError("does not start with a number")

But as a Perl programmer you are perhaps used to this
because Perl does the same conflation thus having
the same problems. (I think. It's been a while ...
Nope! The regexp search results appear to be my
variables now. When I started with perl4 all variables
were either global or "dynamically scoped"-ish with
local)

I'm a little bit worried about doing the following in Python, since I'm
not sure if the compiler is smart enough to avoid doing each regexp
search twice:

for line in inFile:
if rexp1.search(line)
something = rexp1.search(line).group(1)
elif rexp2.search(line):
somethingElse = rexp2.search(line).group(1)

In many cases I am worried about efficiency as these scripts parse a
couple GB of text!

It isn't smart enough. To make it that smart would require
a lot more work. For example, how does it know that the
implementation of "rexp1.search(line)" always returns the
same value? Or even that "rexp1.search" returns the
same bound method?

Andrew
(e-mail address removed)

Jason Lai · Sep 10, 2004

Topher said:
Hi,

I'm relatively new to Python, and I already love it even after several
years of writing Perl. But a few times already I've found myself
writing the following bit of awkward code when parsing text files. Can
anyone suggest a more elegant solution?

rexp1 = re.compile(r'blah(dee)blah')
rexp2 = re.compile(r'hum(dum)')
for line in inFile:
reslt = rexp1.search(line)
if reslt:
something = reslt.group(1)
else:
reslt = rexp2.search(line)
if reslt:
somethingElse = reslt.group(1)

I'm getting more and more nested if statements, which gets ugly and very
hard to follow after the fourth or fifth regexp search.

Equivalent Perl code is more compact but more importantly seems to
communicate the process of searching for multiple regular expressions
more clearly:

while (<IN>) {
if (/blah(dee)blah/) {
$something = $1;
} elsif (/hum(dum)/) {
$somethingElse = $1;
}
}

I'm a little bit worried about doing the following in Python, since I'm
not sure if the compiler is smart enough to avoid doing each regexp
search twice:

for line in inFile:
if rexp1.search(line)
something = rexp1.search(line).group(1)
elif rexp2.search(line):
somethingElse = rexp2.search(line).group(1)

In many cases I am worried about efficiency as these scripts parse a
couple GB of text!

Does anyone have a suggestion for cleaning up this commonplace Python
code construction?

Thanks,
Topher Cawlfield

Does it have to be stored in a different variable? If you have a list of
regexs and you want to see if any of them match, you could create a
compound regex such as "blah(dee)blah|hum(dum)" and search for that
(although you have to be careful about overlaps.)

- Jason Lai

Help with code	0	Jun 12, 2022
Help in fixing this code	8	May 8, 2022
Question about multiple metadata files to one file	0	Feb 14, 2022
Embedding multiple interpreters	14	Dec 6, 2013
Need help in extracting lines from word using python	5	Mar 19, 2013
regexp in Python (from Perl)	7	Oct 19, 2008
Working on mobile css menu with plenty of frustration!	2	Dec 29, 2022
Stripping C-style comments using a Python regexp	4	Jul 27, 2005

Akward code using multiple regexp searches

Topher Cawlfield

Steven Bethard

Andrew Dalke

Jason Lai

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads