Speeding up multiple regex matches

T

Talin

I've run in to this problem a couple of times. Say I have a piece of
text that I want to test against a large number of regular expressions,
where a different action is taken based on which regex successfully
matched. The naive approach is to loop through each regex, and stop
when one succeeds. However, I am finding this to be too slow for my
application -- currently 30% of the run time is being taken up in the
regex matching.

I thought of a couple of approaches, but I am not sure how to make them
work:

1) Combine all of the regular expressions into one massive regex, and
let the regex state machine do all the discriminating. The problem with
this is that it gives you no way to determine which regex was the
matching one.

2) Use the first character of the text string to cut down the search
size. The problem here is that since I don't know the regex patterns in
advance, I would need to inspect each regex and determine the possible
set of starting characters that could be matched. This would require
creating my own regex parser.

3) Use a lexical scannner. This is probably overkill for what I want to
do.

4) Write my own regex system that does what I want. This is more work
than I want to do.

Any thoughts?
 
A

Alex Martelli

Talin said:
1) Combine all of the regular expressions into one massive regex, and
let the regex state machine do all the discriminating. The problem with
this is that it gives you no way to determine which regex was the
matching one.

Place each regex into a parenthesized group, and check which groups have
matched on the resulting matchobject:
('aa', None)

There's a limit of 99 groups, so if you have unbounded number of regexes
to start with you'll have to split them up 99-or-fewer at a time, but
that shouldn't be impossibly hard.


Alex
 
F

Fredrik Lundh

Talin said:
I've run in to this problem a couple of times. Say I have a piece of
text that I want to test against a large number of regular expressions,
where a different action is taken based on which regex successfully
matched. The naive approach is to loop through each regex, and stop
when one succeeds. However, I am finding this to be too slow for my
application -- currently 30% of the run time is being taken up in the
regex matching.

I thought of a couple of approaches, but I am not sure how to make them
work:

1) Combine all of the regular expressions into one massive regex, and
let the regex state machine do all the discriminating. The problem with
this is that it gives you no way to determine which regex was the
matching one.

use a capturing group for each alternative, and use lastindex to quickly
find the match:

http://docs.python.org/lib/match-objects.html

lastindex

The integer index of the last matched capturing group, or None if
no group was matched at all.

also see:

http://effbot.org/zone/xml-scanner.htm

</F>
 
T

Talin

OK that worked really well. In particular, the "lastindex" property of
the match object can be used to tell exactly which group matched,
without having to sequentially search the list of groups.

In fact, I was able to use your idea to cobble together a poor man's
lexer which I am calling "reflex" (Regular Expressions For Lexing).
Here's an example of how it's used:

# Define the states using an enumeration
State = Enum( 'Default', 'Comment', 'String' )

# Create a scanner
scanner = reflex.scanner( State.Default )
scanner.rule( "\s+" )
scanner.rule( "/\*", reflex.set_state( State.Comment ) )
scanner.rule( "[a-zA-Z_][\w_]*", KeywordOrIdent )
scanner.rule( "0x[\da-fA-F]+|\d+", token=TokenType.Integer )
scanner.rule(
"(?:\d+\.\d*|\.\d+)(?:[eE]?[+-]?\d+)|\d+[eE]?[+-]?\d+",
token=TokenType.Real )

# Multi-line comment state
scanner.state( State.Comment )
scanner.rule( "\*/", reflex.set_state( State.Default ) )
scanner.rule( "(?:[^*]|\*(?!/))+" )

# Now, create an instance of the scanner
token_stream = scanner( input_file_iter )
for token in token_stream:
print token

Internally, it creates an array of patterns and actions for each state.
Then when you ask it to create a scanner instance, it combines all of
the patterns into a large regular expression. Input lines are marched
vs. this regex, and if a match succeeds, then the match object's
'lastindenx' property is used to look up the actions to perform in the
array.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,767
Messages
2,569,572
Members
45,046
Latest member
Gavizuho

Latest Threads

Top