Regex speed

  • Thread starter Reinhold Birkenfeld
  • Start date
R

Reinhold Birkenfeld

Hello,

I recently ported a simple utility script to analyze a data file from
Perl to Python that uses regex substitutions, not more complex than

re1 = re.compile(r"\s*<.*>\s*")
re2 = re.compile(r".*\((.*)\).*")
re3 = re.compile(r'^"(.*)"$')

When run without these regex substitutions, the scripts' speed is nearly
equal. However, with the regexes applied, the Python version's running
time is five times or more of the Perl version's.

So my question is: Why is the re module implemented in pure Python?
Isn't it possible to integrate it into the core or rewrite it in C?

--or--

Is there a Python interface for the PCRE library out there?

Thanks

Reinhold
 
A

A.M. Kuchling

re1 = re.compile(r"\s*<.*>\s*")
re2 = re.compile(r".*\((.*)\).*")
re3 = re.compile(r'^"(.*)"$')

You should post the actual code, because these substitutions could be made
more efficient. For example, why are the bracketing \s* in re1 and the
bracketing .* in re2 there? re3 isn't using re.M, so it's equivalent
to 'if s.startswith('"') and s.endswith('"')'.
So my question is: Why is the re module implemented in pure Python?
Isn't it possible to integrate it into the core or rewrite it in C?

The regex engine *is* implemented in C; look at Modules/_sre.c.
Is there a Python interface for the PCRE library out there?

PCRE was used from versions 1.5 up to 2.3; it'll be gone in Python 2.4. You
could try 'import pre' to use it, but I don't expect it to be significantly
faster.

--amk
 
R

Reinhold Birkenfeld

A.M. Kuchling said:
You should post the actual code, because these substitutions could be made
more efficient. For example, why are the bracketing \s* in re1 and the
bracketing .* in re2 there? re3 isn't using re.M, so it's equivalent
to 'if s.startswith('"') and s.endswith('"')'.

You're right. The sub calls are those:

fro = re1.sub("", fro)
fro = re2.sub(r"\1", fro)
fro = re3.sub(r"\1", fro)

So at least the \s* are justified.

But my actual question is why Perl can run the same regexes in what
seems no time at all.
The regex engine *is* implemented in C; look at Modules/_sre.c.

But /usr/lib/python2.3/sre*.py are relatively large for that; what's in
there?
PCRE was used from versions 1.5 up to 2.3; it'll be gone in Python 2.4. You
could try 'import pre' to use it, but I don't expect it to be significantly
faster.

You're right again. Is the pre module using the PCRE C library?

Reinhold
 
A

A.M. Kuchling

But my actual question is why Perl can run the same regexes in what
seems no time at all.

Probably Perl's engine does some particular bit of optimization that
Python's engine isn't doing. I have no idea what optimization that might
be.
But /usr/lib/python2.3/sre*.py are relatively large for that; what's in
there?

That code is the compiler that turns expressions into the bytecode that the
regex engine runs. Having it be in Python only matters to the performance
of re.compile(), and that function usually isn't a bottleneck.
You're right again. Is the pre module using the PCRE C library?

An extensively hacked version of it. Modules/pypcre.c contains the bulk of
it; pcremodule.c is the Python interface to it.

--amk
 
A

Andrew Dalke

Reinhold said:
re1 = re.compile(r"\s*<.*>\s*")
re2 = re.compile(r".*\((.*)\).*")
re3 = re.compile(r'^"(.*)"$')

BTW, do you want those or
re1 = re.compile(r"\s*<[^>]*>\s*")
re2 = re.compile(r".*\(([^)]*)\).*")

(For the last it doesn't make much difference. There will only be
a single backtrack.)

For that matter, what about
re2 = re.compile(r"\(([^)]*)\)")
then using re2.search instead of re2.match?
So my question is: Why is the re module implemented in pure Python?
Isn't it possible to integrate it into the core or rewrite it in C?

It isn't. It's written in C. I've not done timing tests
between Perl and Python's engines for a long time, so I can't
provide feedback on that aspect.

One thing about Python is that we tend to use regexps less
often than Perl. For example, you may be able to use

def find_text_in_matching_pairs(text, start_c = "<", end_c = ">"):
i = text.find(start_c)
if i == -1:
return None
j = text.find(end_c, i)
if j == -1:
return None
return text[i+i:j]

(If you instead what your original regexp says, use
def find_text_in_matching_pairs(text, start_c = "<", end_c = ">"):
i = text.find(start_c)
if i == -1:
return None
j = text.rfind(end_c)
if j < i: # includes 'j == -1' on find failure
return None
return text[i+1:j]


def find1(text):
return find_text_in_matching_pairs(text, "<", ">")

def find2(text):
return find_text_in_matching_pairs(text, "(", ")")

def find3(text):
if text.startswith('"') and text.endswith('"'):
return text[1:-1]
return None
Is there a Python interface for the PCRE library out there?

Python used to use PCRE instead of its current sre, back
in the 1.5 days. Python 1.6/2.x switched to sre in part
because of the need for Unicode support.

The old benchmarks compared pcre and sre and found that
sre was faster. See
http://groups.google.com/groups?oi=djq&selm=an_588925502

Which versions of Python and Perl are you using for
the tests? I know there has been some non-trivial work
for the 2.3 version of Python.

Andrew
(e-mail address removed)
 
R

Reinhold Birkenfeld

Andrew said:
Reinhold said:
re1 = re.compile(r"\s*<.*>\s*")
re2 = re.compile(r".*\((.*)\).*")
re3 = re.compile(r'^"(.*)"$')

BTW, do you want those or
re1 = re.compile(r"\s*<[^>]*>\s*")
re2 = re.compile(r".*\(([^)]*)\).*")

(For the last it doesn't make much difference. There will only be
a single backtrack.)

For that matter, what about
re2 = re.compile(r"\(([^)]*)\)")
then using re2.search instead of re2.match?
So my question is: Why is the re module implemented in pure Python?
Isn't it possible to integrate it into the core or rewrite it in C?

It isn't. It's written in C. I've not done timing tests
between Perl and Python's engines for a long time, so I can't
provide feedback on that aspect.

One thing about Python is that we tend to use regexps less
often than Perl. For example, you may be able to use

def find_text_in_matching_pairs(text, start_c = "<", end_c = ">"):
i = text.find(start_c)
if i == -1:
return None
j = text.find(end_c, i)
if j == -1:
return None
return text[i+i:j]

OK, thank you. I now got rid of all the regexes, and - surprise,
surprise - the speeds are almost equal. The bitter thing about it is
that there are now twelve LOC more in Python that don't make
understanding the code easier.

So the Perl regex engine seems to be highly optimized, at least for
simple expressions.
Python used to use PCRE instead of its current sre, back
in the 1.5 days. Python 1.6/2.x switched to sre in part
because of the need for Unicode support.

The old benchmarks compared pcre and sre and found that
sre was faster. See
http://groups.google.com/groups?oi=djq&selm=an_588925502

Which versions of Python and Perl are you using for
the tests? I know there has been some non-trivial work
for the 2.3 version of Python.

I used 2.3.4.

Reinhold
 
P

Peter Hansen

Reinhold said:
OK, thank you. I now got rid of all the regexes, and - surprise,
surprise - the speeds are almost equal. The bitter thing about it is
that there are now twelve LOC more in Python that don't make
understanding the code easier.

So the Perl regex engine seems to be highly optimized, at least for
simple expressions.

Perhaps it's not just your regexes... as Andrew said, "You should post
the actual code ..." (Preferably both the Perl and the Python, if you
are trying to make this into a them vs. us kind of thing, as you
seem to be doing.)

-Peter
 
R

Reinhold Birkenfeld

Peter said:
Perhaps it's not just your regexes... as Andrew said, "You should post
the actual code ..." (Preferably both the Perl and the Python, if you
are trying to make this into a them vs. us kind of thing, as you
seem to be doing.)

Well, commenting out the regex substitutions in both versions leads to
equal execution times, as I mentioned earlier, so it has to be my regexes.

Anyway, for your pleasure <wink>, the code (it takes a list of newsgroup
postings and calculates a ratio followups/postings):

------------------------------------------------------------------
At first the perl version (not coded by me):
------------------------------------------------------------------

#!/usr/bin/perl -w

use strict;

die unless @ARGV;

my (%post,%fup);
while (<STDIN>)
{
chomp;
my (undef,$from,$mid,$ref) = split /\|/;

$from =~ s|\s*<.*>\s*||;
$from = $1 if $from =~ m|\((.*)\)|;
$from =~ s|^"(.*)"$|$1|;

my $arr = $post{$from} ||= [];
push @$arr,$mid;

$fup{$ref}++;
}

my @arr;
for (keys %post)
{
my $arr = $post{$_};

my $m = 0;
$m += $fup{$_} || 0 for @$arr;

my $n = @$arr;
push @arr,[$m/$n,$n,$m,$_];
}

@arr = sort { $b->[1] <=> $a->[1] } @arr;
$#arr = $ARGV[0]-1;

print "Pos Relevanz Posts F'ups Poster\n";
print "--- -------- ----- ----- ------------------\n";

my $i = 1;
for (sort { $b->[0] <=> $a->[0] || $b->[1] <=> $a->[1] } @arr)
{
printf "%2d) %2.4f %4d %4d %s\n",$i++,@$_;
}

------------------------------------------------------------------
Original Python translation (with regexes):
------------------------------------------------------------------

#!/usr/bin/python

import sys, os, re

if len(sys.argv) == 1:
sys.exit(0)

post = {}
fup = {}

re1 = re.compile(r"\s*<.*>\s*")
re2 = re.compile(r".*\((.*)\).*")
re3 = re.compile(r'^"(.*)"$')

for line in sys.stdin:
line = line.rstrip("\n")
try:
nix, fro, mid, ref = line.split("|")
except: continue

fro = re1.sub("", fro)
fro = re2.sub(r"\1", fro)
fro = re3.sub(r"\1", fro)

post.setdefault(fro, []).append(mid)

if ref in fup: fup[ref] += 1
else: fup[ref] = 1

res = []

for name, arr in post.iteritems():
m = 0
for item in arr:
if item in fup:
m += fup[item]
n = len(arr)
res.append((float(m)/n, n, m, name))

res.sort(lambda x, y: cmp(y[1], x[1]))
del res[int(sys.argv[1]):]

print "Pos Relevanz Posts F'ups Poster"
print "--- -------- ----- ----- ------------------"

res.sort(lambda x, y: cmp(y[0], x[0]) or cmp(y[1], x[1]))
i = 1
for item in res:
print "%2d)" % i, " %2.4f %4d %4d %s" % item
i += 1

------------------------------------------------------------------
Optimized Python (without regexes):
------------------------------------------------------------------

#!/usr/bin/python

import sys, os

if len(sys.argv) == 1:
sys.exit(0)

post = {}
fup = {}

def inparens(text):
i = text.find("(")
if i == -1: return text
j = text.rfind(")")
if j < i: return text
return text[i+1:j]

def removeangles(text):
i = text.find("<")
if i == -1: return text
j = text.rfind(">")
if j < i: return text
return text[0:i].rstrip() + text[j+1:].lstrip()

for line in sys.stdin:
line = line.rstrip("\n")
try:
nix, fro, mid, ref = line.split("|")
except: continue

fro = inparens(removeangles(fro))
if fro.startswith('"'):
if fro.endswith('"'):
fro = fro[1:-1]

post.setdefault(fro, []).append(mid)

if ref in fup: fup[ref] += 1
else: fup[ref] = 1

res = []

for name, arr in post.iteritems():
m = 0
for item in arr:
if item in fup:
m += fup[item]
n = len(arr)
res.append((float(m)/n, n, m, name))

res.sort(lambda x, y: cmp(y[1], x[1]))
del res[int(sys.argv[1]):]

print "Pos Relevanz Posts F'ups Poster"
print "--- -------- ----- ----- ------------------"

res.sort(lambda x, y: cmp(y[0], x[0]) or cmp(y[1], x[1]))
i = 1
for item in res:
print "%2d)" % i, " %2.4f %4d %4d %s" % item
i += 1
 
P

Peter Hansen

Reinhold said:
Well, commenting out the regex substitutions in both versions leads to
equal execution times, as I mentioned earlier, so it has to be my regexes.

I'm sorry, I managed to miss the significance of that
statement in your original post.

I wonder what the impact is of the fact that the re.sub
operations are function calls in Python. The overhead
of function calls is relatively high in Python, so
perhaps an experiment would be revealing. Can you try
with a dummy re.sub() call (create a dummy "re" object
that is global to your module, with a dummy .sub()
method that just does "return") and compare the
speed of that with the version without the re.sub
calls at all? Probably a waste of time, but perhaps
the actual re operations are not so slow after all,
but the calls themselves are.

If that's true, you would at least get a tiny improvement
by alias re.sub to a local name before the loop, to
avoid the global lookup for "re" and then the attribute
lookup for "sub" on each of the three calls, each time
through the loop.

If you can show the format of the input data, I would
be happy to try a little profiling, if you haven't already
done that to prove that the bulk of the time is actually
in the re.sub operation itself.

-Peter
 
R

Reinhold Birkenfeld

Peter said:
I'm sorry, I managed to miss the significance of that
statement in your original post.
;)

I wonder what the impact is of the fact that the re.sub
operations are function calls in Python. The overhead
of function calls is relatively high in Python, so
perhaps an experiment would be revealing. Can you try
with a dummy re.sub() call (create a dummy "re" object
that is global to your module, with a dummy .sub()
method that just does "return") and compare the
speed of that with the version without the re.sub
calls at all? Probably a waste of time, but perhaps
the actual re operations are not so slow after all,
but the calls themselves are.

If that's true, you would at least get a tiny improvement
by alias re.sub to a local name before the loop, to
avoid the global lookup for "re" and then the attribute
lookup for "sub" on each of the three calls, each time
through the loop.

If you can show the format of the input data, I would
be happy to try a little profiling, if you haven't already
done that to prove that the bulk of the time is actually
in the re.sub operation itself.

Well, I did alias the sub methods in that way:

re1sub = re.compile("whatever").sub

There was a performance gain, but it was about 1/100th of the speed
difference.

Reinhold
 
E

Erik Heneryd

Reinhold said:
OK, thank you. I now got rid of all the regexes, and - surprise,
surprise - the speeds are almost equal. The bitter thing about it is
that there are now twelve LOC more in Python that don't make
understanding the code easier.

If your problem is that the re.compile() setup takes too long (compared
to the time taken to do the actual matching/replacing), my guess is that
you don't operate on that much data. And if that's the case, does it
really matter if it's a fraction of a second slower?

Note that str.find() currently uses the naive approach resulting in bad
time complexity, so it's not suitable for larger data sets. See
http://online.effbot.org/2004_08_01_archive.htm#find


Erik
 
P

Peter Hansen

Reinhold said:
re1sub = re.compile("whatever").sub

There was a performance gain, but it was about 1/100th of the speed
difference.

The above line just led me to realize that you are
doing everything at "module level" instead of inside
a function. That means all your variables are global,
instead of local, which means all variable lookups
are being done in a dictionary rather than taking
advantage of the significant optimization provided by
the use of locals (which use simple indexing operations).

Try sticking everything, including the above line
inside a def main() and call that at the end, for
comparison.

(It's more likely at this point, however, that you
are quite right that the re.sub operations themselves
are taking most of the time, so this still won't
result in much overall improvement, I suspect.
Profiling would confirm that, of course.)

-Peter
 
R

Reinhold Birkenfeld

Erik said:
If your problem is that the re.compile() setup takes too long (compared
to the time taken to do the actual matching/replacing), my guess is that
you don't operate on that much data. And if that's the case, does it
really matter if it's a fraction of a second slower?

My data file is 6 MB in size, I don't think that to be too little.

Reinhold
 
R

Reinhold Birkenfeld

Peter said:
The above line just led me to realize that you are
doing everything at "module level" instead of inside
a function. That means all your variables are global,
instead of local, which means all variable lookups
are being done in a dictionary rather than taking
advantage of the significant optimization provided by
the use of locals (which use simple indexing operations).

Try sticking everything, including the above line
inside a def main() and call that at the end, for
comparison.

Done. The results are the following:

Original: 2,579 sec
sub method aliased to variable: 2,542 sec
everything stuffed in function: 2,422 sec

Without regex substitution: 0,523 sec
(but _with_ re.compile calls,
so this can't be the problem)

Perl version: 0,811 sec
Perl version without regexes: 0,568 sec

Python version using find(): 0,870 sec

(all values are averages)

So you're right, speed is getting better.

Reinhold
 
E

Erik Heneryd

Reinhold said:
Original: 2,579 sec
sub method aliased to variable: 2,542 sec
everything stuffed in function: 2,422 sec

Without regex substitution: 0,523 sec
(but _with_ re.compile calls,
so this can't be the problem)

Ah, thought you said something about compile() being the timesink, but I
guess I misunderstood...


Erik
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,776
Messages
2,569,603
Members
45,201
Latest member
KourtneyBe

Latest Threads

Top