Regexp optimization question

M

Magnus Lie Hetland

I'm working on a project (Atox) where I need to match quite a few
regular expressions (several hundred) in reasonably large text files.
I've found that this can easily get rather slow. (There are many
things that slow Atox down -- it hasn't been designed for speed, and
any optimizations will entail quite a bit of refactoring.)

I've tried to speed this up by using the same trick as SPARK, putting
all the regexps into a single or-group in a new regexp. That helped a
*lot* -- but now I have to find out which one of them matched at a
certain location. I haven't yet looked at the performance of the code
for checking this, because I encountered a problem before that: Using
named groups will only work for 100 patterns -- not a terrible
problem, since I can create several 100-group patterns -- and using
named groups slows down the matching *a lot*. As far as I could tell,
using named groups actually was slower than simply matching the
patterns one by one.

So: What can I do? Is there any way of getting more speed here, except
implementing the matching code (i.e. the code right around the calls
to _re) in C or Pyrex? (I've tried using Psyco, but that didn't help;
I guess it might help if I implemented things differently...)

Any ideas?
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Magnus said:
I've tried to speed this up by using the same trick as SPARK, putting
all the regexps into a single or-group in a new regexp. That helped a
*lot* -- but now I have to find out which one of them matched at a
certain location.

Are you using the .lastindex attribute of match objects yet?

Martin
 
G

Günter Jantzen

Magnus Lie Hetland said:
Any ideas?

Maybe Plex is helpful. I did not use it already, but it seems to adress your
problem
The author of Plex is Greg Ewing. He build Pyrex on top of Plex

The documentation
http://www.cosc.canterbury.ac.nz/~greg/python/Plex/version/doc/index.html
contains

"""
Plex is designed to fill a need that is left wanting by the existing Python
regular expression modules. If you've ever tried to use one of them for
implementing a scanner, you will have found that they're not really suited
to the task. You can define a bunch of regular expressions which match your
tokens all right, but you can only match one of them at a time against your
input. To match all of them at once, you have to join them all together into
one big r.e., but then you've got no easy way to tell which one matched.
This is the problem that Plex is designed to solve.
"""

Hope I could help you
Guenter
 
M

Magnus Lie Hetland

Are you using the .lastindex attribute of match objects yet?

Dang! I thought I had read the docs ;)

Seems very useful. I actually need to know all the matches, because
which one gets selected depends on the current priorities (ordering)
in the parser, and cannot be fixed for the scanner. (I could put the
patterns in the or-group in reverse prioritized order, but that would
have to be fixed...)

Thanks for the tip, anyway.
 
M

Magnus Lie Hetland

Few concrete examples, perhaps? Sadly, my telepathetic power is not
what it used to be...

Well... Hard to give specific examples, as the specifics will be
user-specified.

But in order to explain what I'm doing, I could give a rather generic
example.

I might have a bunch of regexps like 'foo1', 'foo2', 'foo3', ...,
'foo500' (these would be more complex, of course).

Now I can do something like this:

hits = []
for pat in pats:
hits.extend(pat.finditer(text))
# Maybe sort them in order of occurrence

*Or* I can do something like this:

bigPat = '(' + '|'.join(pats) + ')'
hits = list(bigPat.finditer(text))

The last one is *much* faster -- but only if I forego named groups.
And without them, I can't find out which patterns matched at which
locations. (I can, as suggested, use .lastindex to find *one* of them,
but not all, as I need.)

I'm sure there are plenty of bottlenecks in my code, but this seems to
be one of them, and it's slow even if I run it alone (without any of
the rest of my parsing code).
 
M

Magnus Lie Hetland

Maybe Plex is helpful. I did not use it already, but it seems to adress your
problem
Ahah.

The author of Plex is Greg Ewing. He build Pyrex on top of Plex

Yeah, I know -- I just never looked at Plex in detail.
[snip]

Yeah, I looked at the docs, and it looks very promising!

One of the reasons I've avoided existing lexers is that I don't do
standard tokenization -- I don't partition all of the text into regexp
tokens. I allow the lexer to skip over text -- somewhat like how
whitespace is normally handled, except that this can be *any* text --
and to return the next token that is of any use to the current parsing
rule.

But who knows -- I may be able to use Plex anyway.

One problem might be that the regexp format seems to be quite
stripped-down (although, of course, a regexp is a regexp,
theoretically speaking ;)

No non-greedy matching, no lookahead/lookback etc.

But if Plex gives a real performance boost, I may be able to live with
that. (The regexp part is functionality that is available to the user,
in my case.)
Hope I could help you

It might. Thanks.
 
M

Magnus Lie Hetland

the techniques discussed in this article may be helpful:

http://effbot.org/zone/xml-scanner.htm

Yes, this is indeed useful. The most useful thing, I guess, is the
lastindex attribute (which was pointed out earlier) -- but I suspect I
can't use it, as I have to know all the patterns that were matched at
a given location (and let the parser choose which one to use, based
on its context-dependent priorities). Or, at least, that's how it
seems to me now.

I *could*, of course, say that ambiguity is not allowed -- or that the
effects aren't defined. Or maybe I could order the patterns in terms
of specificity (but analyzing regexps for specificity is no mean feat,
of course ;)

Thanks for the pointer, though -- a nice tutorial!
 
W

William Park

Magnus Lie Hetland said:
Now I can do something like this:

hits = []
for pat in pats:
hits.extend(pat.finditer(text))
# Maybe sort them in order of occurrence

*Or* I can do something like this:

bigPat = '(' + '|'.join(pats) + ')'
hits = list(bigPat.finditer(text))

The last one is *much* faster -- but only if I forego named groups.
And without them, I can't find out which patterns matched at which
locations. (I can, as suggested, use .lastindex to find *one* of them,
but not all, as I need.)

Since you want both the matched strings and their locations in file, you
pretty much have to this manually, one by one.
 
M

Magnus Lie Hetland

William Park wrote: said:
Since you want both the matched strings and their locations in file, you
pretty much have to this manually, one by one.

Well -- that's more or less what I'm doing. (Or -- I can get the match
objects all at once, of course, using finditer.)

I guess I'll have to look elsewhere for performance improvements. Hm.
 
W

William Park

Magnus Lie Hetland said:
Well -- that's more or less what I'm doing. (Or -- I can get the match
objects all at once, of course, using finditer.)

I guess I'll have to look elsewhere for performance improvements. Hm.

You can write up something in C, as standalone or as patch to Bash
shell. Essentially,
- locate start of string match
- print location (and the string)
- move pointer past the end of string.
- repeat.

Closest you can do using standary tools is strings and byte-offset of
the lines the strings occurs.
 
P

Paul McGuire

One of the reasons I've avoided existing lexers is that I don't do
standard tokenization -- I don't partition all of the text into regexp
tokens. I allow the lexer to skip over text -- somewhat like how
whitespace is normally handled, except that this can be *any* text --
and to return the next token that is of any use to the current parsing
rule.

pyparsing supports this kind of text skipping, using scanString() instead of
parseString(). scanString() is actually a generator, yielding for each
match a tuple consisting of:
- matched tokens (returned as a ParseResults object - sort of super-list,
supporting simple list semantics, but some tokens can be named and accessed
as in a dictionary or as attributes)
- start location in source string
- end location in source string

Download pyparsing at http://pyparsing.sourceforge.net .

-- Paul
 
M

Magnus Lie Hetland

Paul McGuire said:
pyparsing supports this kind of text skipping, using scanString()
instead of parseString().

I already have a working implementation in Python -- if this isn't
more efficient (I'm just talking about the tokenization part) I don't
think there would be much gain in switching.

(IIRC, the pyparsing docs say that pyparsing is slow for complex
grammars, at least.)

BTW: I have not done some experiments with Plex with lots of regular
expressiosn; simply compiling a pattern with 500 alternatives took
forever, whereas re.compile was quite fast.

So... If I can somehow be content with only getting one match per
position, I guess re is the best solution.

Or I could implement something in C (Pyrex)... (Or use something like
pybison.)
 
J

Jeremy Fincher

One problem might be that the regexp format seems to be quite
stripped-down (although, of course, a regexp is a regexp,
theoretically speaking ;)

No non-greedy matching, no lookahead/lookback etc.

It's important to note that many of these features turn the
expressions that represent their use into decidedly non-regular
expressions, regardless of what Pythoneers and Perlers call them :)

Jeremy
 
G

Greg Ewing

Magnus said:
One of the reasons I've avoided existing lexers is that I don't do
standard tokenization -- I don't partition all of the text into regexp
tokens. I allow the lexer to skip over text -- somewhat like how
whitespace is normally handled, except that this can be *any* text --
and to return the next token that is of any use to the current parsing
rule.

But who knows -- I may be able to use Plex anyway.

You should be able to do this with Plex by defining a
fall-back rule that matches any single character and
ignores it.

If you make it the last rule, and make sure it only
matches one character, any other rule that also matches
will take precedence over it.

If you need to match different sets of tokens in
different circumstances, you can do this with states,
and have the parser switch states according to what
it's currently doing.
 
G

Greg Ewing

Magnus said:
BTW: I have not done some experiments with Plex with lots of regular
expressiosn; simply compiling a pattern with 500 alternatives took
forever, whereas re.compile was quite fast.

Yeow, that's really thrashing Plex to death. Did you wait
for it to finish? I'm curious how long it actually took.

Plex is really designed for parsing programming languages
and the like, where the number of patterns is small enough
to be written by hand, and some common sense used in the
process. For example, if you have a lot of patterns with a
regularity to them such as "foo1", "foo2", ..., "foo500"
it will be much faster to compile a pattern such as

Str("foo") + (Range("19") | (Range("09") + Range("09"))
| (Range("14") + Range("09") + Range("09")) | Str("500"))

than to compile a huge expression with 500 alternatives.
The resulting state machine will likely be a lot smaller,
too.

On the plus side, scanning will be just as fast however
you write the REs, which is not true for most RE
implementations of the Perl variety.

One more tip: If you're careful what sort of things you
use for actions, you can pickle the Lexicon, so that it
only needs to be generated once rather than each time the
program starts up. I'm currently doing this in Pyrex, and
it works quite well. Let me know if you want any more info
about doing this.
 
G

Greg Ewing

Jeremy said:
It's important to note that many of these features turn the
expressions that represent their use into decidedly non-regular
expressions, regardless of what Pythoneers and Perlers call them :)

Indeed, which is one reason Plex doesn't do them. Plex
converts all the REs into a big DFA, which means that
Plex REs must truly be regular.

This is the price you pay for scanning time independent
of the complexity of the REs.
 
F

Fredrik Lundh

Jeremy said:
It's important to note that many of these features turn the
expressions that represent their use into decidedly non-regular
expressions, regardless of what Pythoneers and Perlers call them :)

the use of "regular" in this context refers to religion, not computer
science.

</F>
 
B

Bengt Richter

Well -- that's more or less what I'm doing. (Or -- I can get the match
objects all at once, of course, using finditer.)

I guess I'll have to look elsewhere for performance improvements. Hm.
Have you tried something like this?
>>> s = 'asdad foobaz rhubarb pie'
>>> import re
>>> rxo = re.compile(r'(foo|bar|baz)')
>>> rxo.split(s)[1::2]
['foo', 'baz', 'bar']

Regards,
Bengt Richter
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

No members online now.

Forum statistics

Threads
473,743
Messages
2,569,478
Members
44,899
Latest member
RodneyMcAu

Latest Threads

Top