Regular Expressions - Python vs Perl

C

codecraig

Hi,
I am interested in regular expressions and how Perl and Python
compare. Particulary, I am interested in performance (i.e. speed),
memory usage, flexibility, completeness (i.e. supports simple and
complex regex operations...basically is RegEx a strong module/library
in Python?)

Anyone have any information on this? Any numbers, benchmarks?

Thanks so much. I know this is a python user group...but try to be has
un-biased as you can.
 
C

codecraig

Well so far from what I have found, Perl is faster than Python for
RegEx, although perl is harder to read.
 
F

Fredrik Lundh

codecraig said:
Well so far from what I have found, Perl is faster than Python for
RegEx, although perl is harder to read.

is this based on actual benchmarks, or just what people are saying on
the intarweb?

</F>
 
P

Paul McGuire

I'd be very interested to see if there actually is a benchmark suite
for regexp's. I imagine that this could be an easy area for quite a
varied set of results, depending on the expression features included in
the actual regexp being tested, and even the nature of the input text.
For example, a simple re that just scans for words in a text stream may
perform very differently from one that searches for delimited text, has
to lookahead for greedy matches, maintains return groups, performs
named substitutions, etc.

Without a pretty thorough benchmark suite, I would be dubious of
performance claims being very much better than anecdotes.

-- Paul
 
T

Thomas Bartkus

Well so far from what I have found, Perl is faster than Python for
RegEx, although perl is harder to read.

Yawn

How about Python being easier to *write*?

It never ceases to amaze me. It takes days, weeks, months, sometimes even
years to write significantly useful software. And yet so many seem to think
it is worthy to bother over the seconds that might be saved at execution
time.

If one were to achieve a mere percentage point or two in improving the
"write" efficiency of software - think how much more the world gains in
software quality and quantity. How about man hours saved? Why does anyone
still waste so much angst over execution speed?

I doubt the total execution time for all the RegEx queries you ever ran took
as much time as you just wasted on your little experiment.
Thomas Bartkus
 
F

Fredrik Lundh

Paul said:
I'd be very interested to see if there actually is a benchmark suite
for regexp's. I imagine that this could be an easy area for quite a
varied set of results, depending on the expression features included in
the actual regexp being tested, and even the nature of the input text.
For example, a simple re that just scans for words in a text stream may
perform very differently from one that searches for delimited text, has
to lookahead for greedy matches, maintains return groups, performs
named substitutions, etc.

Without a pretty thorough benchmark suite, I would be dubious of
performance claims being very much better than anecdotes.

we did fairly extensive benchmarks when moving from PRE to SRE back in
the 1.6 days (partially based on benchmarks developed when moving from
REGEX to PRE):

http://mail.python.org/pipermail/python-dev/2000-August/007797.html

(but as you can see, this benchmarking was done to make sure that the new
engine didn't slow things down, not to see if SRE was slower or faster than
the "competition". feel free to try the microbenchmarks with recent versions
of Python and you're favourite non-Python language...)

</F>
 
D

djw

Thomas said:
Yawn

How about Python being easier to *write*?

It never ceases to amaze me. It takes days, weeks, months, sometimes even
years to write significantly useful software. And yet so many seem to think
it is worthy to bother over the seconds that might be saved at execution
time.

If one were to achieve a mere percentage point or two in improving the
"write" efficiency of software - think how much more the world gains in
software quality and quantity. How about man hours saved? Why does anyone
still waste so much angst over execution speed?

I doubt the total execution time for all the RegEx queries you ever ran took
as much time as you just wasted on your little experiment.
Thomas Bartkus
While I agree with (most of) your points, one should not overlook the
fact that there are cases when performance does matter (huge datasets
maybe?). Since the OP didn't indicate why performance was important to
him/her, one cannot assume that its not a valid concern.

-Don
 
C

codecraig

I found some benchmarking (perhaps simple) but search for "The Great
Computer language shootout" ....look at the original shootout and the
win32 one.

Thomas:
"I doubt the total execution time for all the RegEx queries you ever
ran took
as much time as you just wasted on your little experiment. " .....no
need to be angry. I don't have some "little experiment", but thanks
for being concerned about me wasting my time.

I do understand that python is certainly easier to read, no doubt. I
was just doing some research to find out about speed/performance
between the two.

But thanks for pointing out, again, that Pyton is easier to read.
(let's just not forget that python is great for other things other than
just readability.)
 
T

Thomas Bartkus

codecraig said:
I found some benchmarking (perhaps simple) but search for "The Great
Computer language shootout" ....look at the original shootout and the
win32 one.

Thomas:
"I doubt the total execution time for all the RegEx queries you ever
ran took
as much time as you just wasted on your little experiment. " .....no
need to be angry. I don't have some "little experiment", but thanks
for being concerned about me wasting my time.

I do understand that python is certainly easier to read, no doubt. I
was just doing some research to find out about speed/performance
between the two.

But thanks for pointing out, again, that Pyton is easier to read.
(let's just not forget that python is great for other things other than
just readability.)

Your quite welcome.
BUT I didn't say a word about readability.

I claimed it was easier to *write*.
Thomas Bartkus
 
T

Thomas Bartkus

djw said:
While I agree with (most of) your points, one should not overlook the
fact that there are cases when performance does matter (huge datasets
maybe?). Since the OP didn't indicate why performance was important to
him/her, one cannot assume that its not a valid concern.

Yes, yes, but then - the converse would be true. One cannot assume it *is*
a valid concern.

I could have gone further and pointed out that that RegEx module (now re !)
is probably just C code hooked to Python syntax. IOW - the execution speed
of his RegEx module has *nothing at all* to do with the Python language,
only the efficiency of the particular code library he was using.

All in all, execution speed for any one particular task is a sucky way to
evaluate a general purpose programming language. If gonzo RegEx query
performance was of utmost importance, would anyone put either of Perl or
Python at the top of his list?

Thomas Bartkus
 
J

James Stroud

Is it relevant that Python can produce compiled expressions? I don't think
that there is such a thing with Perl.

Also, to all of the dozen or so people in the world less wise than me about
programming: don't choose your language on how fast the regex engine is. This
would then become a case of premature optimization.

James

Yes, yes, but then - the converse would be true. One cannot assume it *is*
a valid concern.

I could have gone further and pointed out that that RegEx module (now re !)
is probably just C code hooked to Python syntax. IOW - the execution speed
of his RegEx module has *nothing at all* to do with the Python language,
only the efficiency of the particular code library he was using.

All in all, execution speed for any one particular task is a sucky way to
evaluate a general purpose programming language. If gonzo RegEx query
performance was of utmost importance, would anyone put either of Perl or
Python at the top of his list?

Thomas Bartkus

--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com/
 
T

Terry Reedy

I am interested in regular expressions and how Perl and Python
compare. Particulary, I am interested in performance (i.e. speed),
memory usage, flexibility, completeness (i.e. supports simple and
complex regex operations...basically is RegEx a strong module/library
in Python?)

Depending upon you particular application, 'completeness' may be a more
relevant concern than 'performance'. I believe the original Python regex
engine did not have all the Perl extensions, some of them decidedly 'non
regular'. It was replace by the 'perl-compatible regex engine' (pcre or
pre), written in C by a non-pythonista so that other
languages/applications, like Python, could drop it in and have what the
title claimed -- perl-like re capability. The current sre engine was
locally written to include unicode, with the re syntax unchanged (? or
nearly so) from the pre. So I would say that the answer to your last
question of 'yes'.

Terry J. Reedy
 
K

Karl A. Krueger

Terry Reedy said:
Depending upon you particular application, 'completeness' may be a
more relevant concern than 'performance'. I believe the original
Python regex engine did not have all the Perl extensions, some of them
decidedly 'non regular'. It was replace by the 'perl-compatible regex
engine' (pcre or pre), written in C by a non-pythonista so that other
languages/applications, like Python, could drop it in and have what
the title claimed -- perl-like re capability.

By way of comparison, there do exist at least some Perl-compatible regex
libraries in other non-Perl languages, which don't use libpcre.

An example is CL-PPCRE (http://www.weitz.de/cl-ppcre/), which claims to
be "more compatible with the regex semantics of Perl 5.8.0 than, say,
Perl 5.6.1 is."
 
C

codecraig

Thanks for the input. I was just looking for some feedback about which
was better and faster, if an answer exists. However, I am not choosing
Perl or Python b/c of it's RegEx engine as someone mentioned. The
question was just because I was curious, sorry if I misled you to think
I was choosing which language to program with based on the RegEx
performance. Also, I was not choosing based on performance...I just
wanted to know how they compared.
 
I

Ilpo =?iso-8859-1?Q?Nyyss=F6nen?=

James Stroud said:
Is it relevant that Python can produce compiled expressions? I don't think
that there is such a thing with Perl.

The problem in python here is that it needs to always recompile the
regexp. I would like to have a way to write a regexp as a constant and
then python should compile that regexp to the byte-code file.

This is a problem when one has a big amount of regexps. One example is
the xmlproc parser in PyXML,

This is not a problem in a program that continues to run long times,
but I want short lived programs like command line apps.

Of course we do have ways to go around that limitation, but that is
just ugly.
 
V

Ville Vainio

"Ilpo" == Ilpo Nyyssönen <iny> writes:


Ilpo> The problem in python here is that it needs to always
Ilpo> recompile the regexp. I would like to have a way to write a
Ilpo> regexp as a constant and then python should compile that
Ilpo> regexp to the byte-code file.

Ilpo> This is a problem when one has a big amount of regexps. One
Ilpo> example is the xmlproc parser in PyXML,

Read the source for sre.py, esp. _compile. The compiled regexps are
cached, so when you invoke e.g. re.match(), it doesn't recompile the
regexp.

So this point is moot, and perl's approach is excessive special
casing.
 
I

Ilpo =?iso-8859-1?Q?Nyyss=F6nen?=

Ville Vainio said:
Ilpo> The problem in python here is that it needs to always
Ilpo> recompile the regexp. I would like to have a way to write a
Ilpo> regexp as a constant and then python should compile that
Ilpo> regexp to the byte-code file.

Ilpo> This is a problem when one has a big amount of regexps. One
Ilpo> example is the xmlproc parser in PyXML,

Read the source for sre.py, esp. _compile. The compiled regexps are
cached, so when you invoke e.g. re.match(), it doesn't recompile the
regexp.

If you would have read what I waid, you would have noticed this:

,----
| I would like to have a way to write a regexp as a constant and then
| python should compile that regexp to the byte-code file.
`----

and this:

,----
| This is not a problem in a program that continues to run long times,
| but I want short lived programs like command line apps.
`----

Of course it caches those when running. The point is that it needs to
recompile every time you have restarted the program. With short lived
command line programs this really can be a problem.

And yes, I have read the source of sre.py and I have made an ugly
module that digs the compiled data and pickles it to a file and then
in next startup it reads that file and puts the stuff back to the
cache.
 
R

Roy Smith

[email protected] (Ilpo Nyyssönen) said:
Of course it caches those when running. The point is that it needs to
recompile every time you have restarted the program. With short lived
command line programs this really can be a problem.

Are you speculating that it might be a problem, or saying that you have
seen it be a problem in a real-life program?

I just generated a bunch of moderately simple regexes from a dictionary
wordlist. Looks something like:

Roy-Smiths-Computer:play$ head exps
a.*a[0-9]{34}
a.*ah[0-9]{34}
a.*ahed[0-9]{34}
a.*ahing[0-9]{34}
a.*ahs[0-9]{34}
a.*al[0-9]{34}
a.*alii[0-9]{34}
a.*aliis[0-9]{34}
a.*als[0-9]{34}
a.*ardvark[0-9]{34}

Then I ran them through a little script that does:

for exp in sys.stdin.readlines():
regex = re.compile (exp)

and timed it for various numbers of lines. On my G4 Powerbook (1 GHz
PowerPC), I'm compiling about 1000 regex's per second:

Roy-Smiths-Computer:play$ time head -5000 < exps | ./regex.py

real 0m5.208s
user 0m4.690s
sys 0m0.090s

So, my guess is that unless you're compiling 100's of regexes each time you
start up, the one-time compilation costs are probably not significant.
And yes, I have read the source of sre.py and I have made an ugly
module that digs the compiled data and pickles it to a file and then
in next startup it reads that file and puts the stuff back to the
cache.

That's exactly what I would have done if I really needed to improve startup
speed. In fact, I did something like that many moons ago, in a previous
life. See R. Smith, "A finite state machine algorithm for finding
restriction sites and other pattern matching applications", CABIOS, Vol 4,
no. 4, 1988. In that case, I had about 1200 patterns I was searching for
(and doing it on hardware running about 1% of the speed of my current
laptop).

BTW, why did you have to dig out the compiled data before pickling it?
Could you not have just pickled whatever re.compile() returned?
 
V

Ville Vainio

"Ilpo" == Ilpo Nyyssönen <iny> writes:

Ilpo> Of course it caches those when running. The point is that it
Ilpo> needs to recompile every time you have restarted the
Ilpo> program. With short lived command line programs this really
Ilpo> can be a problem.

I didn't imagine it could be longer than 1 second overhead - and if
you have so many regexps, it must do something so nontrivial that 1
second doesn't matter. Perhaps I have a different mindset about this
:).

Ilpo> And yes, I have read the source of sre.py and I have made an
Ilpo> ugly module that digs the compiled data and pickles it to a
Ilpo> file and then in next startup it reads that file and puts
Ilpo> the stuff back to the cache.

What's so ugly about it? The fact that you need to rewrite the cache
when you change some of the regexps? I can't imagine you change more
than, say, 10 of the regexps a day (compiling of which is an
insignificant performance hit) and when you "ship" the script, you
will freeze the regexps anyway.
 
T

Terry Hancock

I am interested in regular expressions and how Perl and Python
compare. Particulary, I am interested in performance (i.e. speed),
memory usage, flexibility, completeness (i.e. supports simple and
complex regex operations...basically is RegEx a strong module/library
in Python?)

Understand that I have used regexes very very little in Perl (I took a
class, that's about it). However, I have translated a couple of Perl modules
into Python.

I find that Perl programmers use the rather opaque "regex style" much
too often, so that I usually replace several regexes with simple string
searches, e.g.

original program: uses regex to match /.*foo.*/
python translation: just use s.find('foo')

That's not really for performance reasons (though it probably is faster?),
but because it just makes it clearer what you're trying to do.

OTOH, some of the regexes will be "real" regexes, in which case
Python's way of expressing regexes as strings makes things a whole
lot clearer, e.g.:

junk = r'.*'
word = r'\b\w+\b'
domain = r'(%s\.)*%s' % (word,word)
re_mail = re.compile(junk + word + '@' + domain + junk )

Although, of course, you can just write:

re_mail = re.compile(r'.*\b\w+\b@(\b\w+\b.)*\b\w+\b.*')

Which is shorter, but frankly, I had a hard time just keeping it straight to
type it here --- I think the first version was actually faster to write, even if
it takes up more space. Also, I screwed up on the first time when I wrote
the regex for a word (forgot the '+'), so having it factored out like this made
it faster to fix the mistake.

Which is still a dumb example, but you can see what I mean about making
the code easier to read / refactor. AFAIK, Perl does not make this particularly
easy.

Python regexes probably allow almost, but not quite all, of what Perl
regexes do (I think the current Python regex language is pretty much
identical with the one in Perl 5, but some newer features are in the most
cutting-edge Perl release, IIRC).

For all but the simplest jobs, Python regexes should be compiled, as I
do above. In fact, I just never bother with using them directly --- I think
the regex will get compiled when used, even if you don't do it explicitly,
and the explicit compiled regex can be stored for multiple uses, etc.

Although it wouldn't surprise me to learn that Perl's regex engine is
slightly more optimized (seeing as it is used so much), I wouldn't want
to bet on it. I doubt you'll notice any difference even if one exists, and
the speedup from eliminating regexes where they don't belong would
probably wipe it out anyway.
Anyone have any information on this? Any numbers, benchmarks?

No benchmarks, sorry. I don't care enough about the speed. But I do
feel that Python regexes are both clearer and more flexible. They encourage
code re-use and self-documentation. And they can pretty much do whatever
their Perl equivalents can.

OTOH, Python programmers do not love them the way Perl programmers
do. So they are used less. This is not least because Python has a lot of
very powerful higher-level string manipulation tools.
Thanks so much. I know this is a python user group...but try to be has
un-biased as you can.

Can't claim to be unbiased, sorry. ;-)

Cheers,
Terry
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,480
Members
44,900
Latest member
Nell636132

Latest Threads

Top