Regular Expressions - Python vs Perl

I

Ilpo =?iso-8859-1?Q?Nyyss=F6nen?=

Roy Smith said:
Are you speculating that it might be a problem, or saying that you have
seen it be a problem in a real-life program?

Well, it depends, I might say yes. I have a calendar app with command
line user interface. There the use is like this: "view, add, view,
edit, view, ..." and those are separate command invocations. In that
case a second in startup speed can be a long time. And I did use the
profiler and it did show the sre compiling to be the slowest thing.

Nowdays I use libxml2-python as the XML parser and so the problem is
not so acute anymore. (That is just harder to get in running for
python compiled from source outside the rpm system and it is not so
easy to use via DOM interface.)
I just generated a bunch of moderately simple regexes from a dictionary
wordlist. Looks something like:
[...]

So, my guess is that unless you're compiling 100's of regexes each time you
start up, the one-time compilation costs are probably not significant.

Well, as I said, I did get it to be the worst in profiler when using
PyXML/xmlproc.
That's exactly what I would have done if I really needed to improve startup
speed. In fact, I did something like that many moons ago, in a previous
life. See R. Smith, "A finite state machine algorithm for finding
restriction sites and other pattern matching applications", CABIOS, Vol 4,
no. 4, 1988. In that case, I had about 1200 patterns I was searching for
(and doing it on hardware running about 1% of the speed of my current
laptop).

The problem is that it is not so easy to get ALL of the regexps dumped
in that way.
BTW, why did you have to dig out the compiled data before pickling it?
Could you not have just pickled whatever re.compile() returned?

Because it dumps the original regexp and then compiles it when loading.
 
F

Fredrik Lundh

Ilpo said:
Well, it depends, I might say yes. I have a calendar app with command
line user interface. There the use is like this: "view, add, view,
edit, view, ..." and those are separate command invocations. In that
case a second in startup speed can be a long time. And I did use the
profiler and it did show the sre compiling to be the slowest thing.

so you picked the wrong file format for the task, and the slowest tool you
could find for that file format, and instead of fixing that, you decided that the
regular expression engine was to blame for the bad performance. hmm.
Nowdays I use libxml2-python as the XML parser and so the problem is
not so acute anymore. (That is just harder to get in running for
python compiled from source outside the rpm system and it is not so
easy to use via DOM interface.)

python has shipped with a fast XML parser since 2.1, or so.

</F>
 
I

Ilpo =?iso-8859-1?Q?Nyyss=F6nen?=

Fredrik Lundh said:
so you picked the wrong file format for the task, and the slowest
tool you could find for that file format, and instead of fixing
that, you decided that the regular expression engine was to blame
for the bad performance. hmm.

What would you recommend instead?

I have searched alternatives, but somehow I still find XML the best
there is. It is a standard format with standard programming API.

I don't want to lose my calendar data. XML as a standard format makes
it easier to convert later to some other format. As a textual format
it is also readable as raw also and this eases debugging.

And my point is that the regular expression compilation can be a
problem in python. The current regular expression engine is just
unusable slow in short lived programs with a bit bigger amount of
regexps. And fixing it should not be that hard: an easy improvement
would be to add some kind of storing mechanism for the compiled
regexps. Are there any reasons not to do this?
python has shipped with a fast XML parser since 2.1, or so.

With what features? validation? I really want a validating parser with
a DOM interface. (Or something better than DOM, must be object
oriented.)

I don't want to make my programs ugly (read: use some more low level
interface) and error prone (read: no validation) to make them fast.
 
V

Ville Vainio

Ilpo> What would you recommend instead?

Ilpo> I have searched alternatives, but somehow I still find XML
Ilpo> the best there is. It is a standard format with standard
Ilpo> programming API.

Ilpo> I don't want to lose my calendar data. XML as a standard
Ilpo> format makes it easier to convert later to some other
Ilpo> format. As a textual format it is also readable as raw also
Ilpo> and this eases debugging.

Use pickle, perhaps, for optimal speed and code non-ugliness. You can
always use xml as import/export format, perhaps even dumping the db to
xml at the end of each day.

Ilpo> And my point is that the regular expression compilation can
Ilpo> be a problem in python. The current regular expression
Ilpo> engine is just unusable slow in short lived programs with a
Ilpo> bit bigger amount of regexps. And fixing it should not be
Ilpo> that hard: an easy improvement would be to add some kind of
Ilpo> storing mechanism for the compiled regexps. Are there any
Ilpo> reasons not to do this?

It should start life as a third-party module (perhaps written by you,
who knows :). If it is deemed useful and clean enough, it could be
integrated w/ python proper. This is clearly something that should not
be in the python core, because the regexps themselves aren't there
either.

Ilpo> With what features? validation? I really want a validating
Ilpo> parser with a DOM interface. (Or something better than DOM,
Ilpo> must be object oriented.)

Check out (coincidentally) Fredrik's elementtree:

http://effbot.org/zone/element-index.htm

Ilpo> I don't want to make my programs ugly (read: use some more
Ilpo> low level interface) and error prone (read: no validation)
Ilpo> to make them fast.

Why don't you use external validation on the created xml? Validating
it every time sounds like way too much like Javaic B&D to be fun
anymore. Pickle should serve you well, and would probably remove about
half of your code. "Do the simplest thing that could possibly work"
and all that.
 
I

Ilpo =?iso-8859-1?Q?Nyyss=F6nen?=

[reorganized a bit]

Ville Vainio said:
Why don't you use external validation on the created xml? Validating
it every time sounds like way too much like Javaic B&D to be fun
anymore. Pickle should serve you well, and would probably remove about
half of your code. "Do the simplest thing that could possibly work"
and all that.

What is the point in doing validation if it isn't done every time? Why
wouldn't I do it every time? It isn't that slow thing to do.

Pickle doesn't have validation. I am not comfortable for using it as
storage format that should be reliable over years when the program
evolves. It also doesn't tell me if my program has put something other
to the data than I meant to. The program will just throw some weird
exception.

I want to do the simplest thing, but I also want something that helps
me keep the program usable also in the future. I prefer putting some
resources to get some validation to it initially than use later more
resouces to do something with undetermined lump of data.
Ilpo> With what features? validation? I really want a validating
Ilpo> parser with a DOM interface. (Or something better than DOM,
Ilpo> must be object oriented.)

Check out (coincidentally) Fredrik's elementtree:

http://effbot.org/zone/element-index.htm

At least the interface looks quite simple and usable. With some
validation wrapping over it, it might be ok...
Ilpo> And my point is that the regular expression compilation can
Ilpo> be a problem in python. The current regular expression
Ilpo> engine is just unusable slow in short lived programs with a
Ilpo> bit bigger amount of regexps. And fixing it should not be
Ilpo> that hard: an easy improvement would be to add some kind of
Ilpo> storing mechanism for the compiled regexps. Are there any
Ilpo> reasons not to do this?

It should start life as a third-party module (perhaps written by you,
who knows :). If it is deemed useful and clean enough, it could be
integrated w/ python proper. This is clearly something that should not
be in the python core, because the regexps themselves aren't there
either.

How can it work automatically in separate module? Replacing the
re.compile with something sounds possible way of getting the regexps,
but how and where to store the compiled data? Is there a way to put it
to the byte code file?

Maybe I need to take a look at it when I find the time...
 
F

Fredrik Lundh

Ilpo said:
What is the point in doing validation if it isn't done every time? Why
wouldn't I do it every time? It isn't that slow thing to do.

DTD validation is useful in two cases: making sure that data from
a foreign source has the right structure, and making sure that data
you create has the right structure. The former is relevant for de-
ployed code, but the latter really only makes sense during deve-
lopment, and can easily be solved by running an external validator
as part of your test suite.
Pickle doesn't have validation. I am not comfortable for using it as
storage format that should be reliable over years when the program
evolves. It also doesn't tell me if my program has put something other
to the data than I meant to.

But DTD validation doesn't tell you that either -- it's only concerned
with the structure, not the content. You can get a bit further with better
schema technologies, but if you want reliable storage, use checksums
or digests. Validation is like the helmet used by skydivers; if you think
that's all you need, you sure is going to be surprised when you hit the
ground.
I want to do the simplest thing, but I also want something that helps
me keep the program usable also in the future. I prefer putting some
resources to get some validation to it initially than use later more
resouces to do something with undetermined lump of data.

If you want the simplest thing, get rid of the DTD, and make your
loader ignore things that it doesn't recognize, use default values for
fields that are not required (or weren't in the format from the start),
and give a nice readable error message if something required is
missing. That'll give you a nice, portable, reliable, and extremely
future-proof design.
At least the interface looks quite simple and usable. With some
validation wrapping over it, it might be ok...

I was going to point you to a validating parser for ET, but the "it might
be ok" statement is a bit too arrogant for my taste.

</F>
 
V

Ville Vainio

"Ilpo" == Ilpo Nyyssönen <iny> writes:

Ilpo> Pickle doesn't have validation. I am not comfortable for
Ilpo> using it as storage format that should be reliable over
Ilpo> years when the program evolves. It also doesn't tell me if

That's why you should implement xml import/export mechanism and use
the xml file as the "canonical" data, while the pickle is only a cache
for the data.

Ilpo> How can it work automatically in separate module? Replacing
Ilpo> the re.compile with something sounds possible way of getting
Ilpo> the regexps, but how and where to store the compiled data?
Ilpo> Is there a way to put it to the byte code file?

Do what you already did - dump the regexp cache to a separate file.
 
I

Ilpo =?iso-8859-1?Q?Nyyss=F6nen?=

Fredrik Lundh said:
DTD validation is useful in two cases:
[...]

I didn't mention DTD validation. Yes, I know the limitations of DTD
validation. DTD validation gives a clear error message with line
number in case of it doesn't match.

Show me this:

- An object oriented storage library
- A flat thing is not enough, needs some hierarchy, like in XML
- Validation that also converts the data to pythonic types, like
numbers to ints or data of my objects to my objects
- Includes a way to define version migration steps
- Parsing and validation must be fast
- Storage format preferably a readable text file
- Easy to use in application

These things would be nice to have in it too:

- Multiple backends, at least text file, XML and SQL database
- Some kind of synchronization or replication utilities
But DTD validation doesn't tell you that either -- it's only concerned
with the structure, not the content.
[...]

Pickle doesn't have even that. Also I can't read pickle file without
doing some program to dump it in readable format. So, I can't use
validation to make sure the data in pickle is the one I want and I
can't use less to see what is in the file. I really have NO IDEA what
is in a pickle file.

Or, yes, I clearly would need to build the validation myself on top of
it! Not going to happen.
If you want the simplest thing, get rid of the DTD, and make your
loader ignore things that it doesn't recognize, use default values for
fields that are not required (or weren't in the format from the start),
and give a nice readable error message if something required is
missing. That'll give you a nice, portable, reliable, and extremely
future-proof design.

So my program will just work in the wrong way if I make a typo to a
non-required field when writing the file? No thanks.
 
I

Ilpo =?iso-8859-1?Q?Nyyss=F6nen?=

Ville Vainio said:
Ilpo> Pickle doesn't have validation. I am not comfortable for
Ilpo> using it as storage format that should be reliable over
Ilpo> years when the program evolves. It also doesn't tell me if

That's why you should implement xml import/export mechanism and use
the xml file as the "canonical" data, while the pickle is only a cache
for the data.

Would make the program too complex, unless it is done by a library. I
actually prefer saving only once and doing that in fast, reliable way.
Ilpo> How can it work automatically in separate module? Replacing
Ilpo> the re.compile with something sounds possible way of getting
Ilpo> the regexps, but how and where to store the compiled data?
Ilpo> Is there a way to put it to the byte code file?

Do what you already did - dump the regexp cache to a separate file.

That didn't get all of the regexps. It only got the regexps that were
loaded in the time I dumped the cache.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,062
Latest member
OrderKetozenseACV

Latest Threads

Top