M
Magnus Lie Hetland
I'm working on a project (Atox) where I need to match quite a few
regular expressions (several hundred) in reasonably large text files.
I've found that this can easily get rather slow. (There are many
things that slow Atox down -- it hasn't been designed for speed, and
any optimizations will entail quite a bit of refactoring.)
I've tried to speed this up by using the same trick as SPARK, putting
all the regexps into a single or-group in a new regexp. That helped a
*lot* -- but now I have to find out which one of them matched at a
certain location. I haven't yet looked at the performance of the code
for checking this, because I encountered a problem before that: Using
named groups will only work for 100 patterns -- not a terrible
problem, since I can create several 100-group patterns -- and using
named groups slows down the matching *a lot*. As far as I could tell,
using named groups actually was slower than simply matching the
patterns one by one.
So: What can I do? Is there any way of getting more speed here, except
implementing the matching code (i.e. the code right around the calls
to _re) in C or Pyrex? (I've tried using Psyco, but that didn't help;
I guess it might help if I implemented things differently...)
Any ideas?
regular expressions (several hundred) in reasonably large text files.
I've found that this can easily get rather slow. (There are many
things that slow Atox down -- it hasn't been designed for speed, and
any optimizations will entail quite a bit of refactoring.)
I've tried to speed this up by using the same trick as SPARK, putting
all the regexps into a single or-group in a new regexp. That helped a
*lot* -- but now I have to find out which one of them matched at a
certain location. I haven't yet looked at the performance of the code
for checking this, because I encountered a problem before that: Using
named groups will only work for 100 patterns -- not a terrible
problem, since I can create several 100-group patterns -- and using
named groups slows down the matching *a lot*. As far as I could tell,
using named groups actually was slower than simply matching the
patterns one by one.
So: What can I do? Is there any way of getting more speed here, except
implementing the matching code (i.e. the code right around the calls
to _re) in C or Pyrex? (I've tried using Psyco, but that didn't help;
I guess it might help if I implemented things differently...)
Any ideas?