Speeding up multiple regex matches

Talin · Nov 18, 2005

I've run in to this problem a couple of times. Say I have a piece of
text that I want to test against a large number of regular expressions,
where a different action is taken based on which regex successfully
matched. The naive approach is to loop through each regex, and stop
when one succeeds. However, I am finding this to be too slow for my
application -- currently 30% of the run time is being taken up in the
regex matching.

I thought of a couple of approaches, but I am not sure how to make them
work:

1) Combine all of the regular expressions into one massive regex, and
let the regex state machine do all the discriminating. The problem with
this is that it gives you no way to determine which regex was the
matching one.

2) Use the first character of the text string to cut down the search
size. The problem here is that since I don't know the regex patterns in
advance, I would need to inspect each regex and determine the possible
set of starting characters that could be matched. This would require
creating my own regex parser.

3) Use a lexical scannner. This is probably overkill for what I want to
do.

4) Write my own regex system that does what I want. This is more work
than I want to do.

Any thoughts?

Alex Martelli · Nov 18, 2005

Talin said:
1) Combine all of the regular expressions into one massive regex, and
let the regex state machine do all the discriminating. The problem with
this is that it gives you no way to determine which regex was the
matching one.

Place each regex into a parenthesized group, and check which groups have
matched on the resulting matchobject:
('aa', None)

There's a limit of 99 groups, so if you have unbounded number of regexes
to start with you'll have to split them up 99-or-fewer at a time, but
that shouldn't be impossibly hard.

Alex

Fredrik Lundh · Nov 18, 2005

Talin said:
I've run in to this problem a couple of times. Say I have a piece of
text that I want to test against a large number of regular expressions,
where a different action is taken based on which regex successfully
matched. The naive approach is to loop through each regex, and stop
when one succeeds. However, I am finding this to be too slow for my
application -- currently 30% of the run time is being taken up in the
regex matching.

I thought of a couple of approaches, but I am not sure how to make them
work:

1) Combine all of the regular expressions into one massive regex, and
let the regex state machine do all the discriminating. The problem with
this is that it gives you no way to determine which regex was the
matching one.

use a capturing group for each alternative, and use lastindex to quickly
find the match:

http://docs.python.org/lib/match-objects.html

lastindex

The integer index of the last matched capturing group, or None if
no group was matched at all.

also see:

http://effbot.org/zone/xml-scanner.htm

</F>

Talin · Nov 19, 2005

OK that worked really well. In particular, the "lastindex" property of
the match object can be used to tell exactly which group matched,
without having to sequentially search the list of groups.

In fact, I was able to use your idea to cobble together a poor man's
lexer which I am calling "reflex" (Regular Expressions For Lexing).
Here's an example of how it's used:

# Define the states using an enumeration
State = Enum( 'Default', 'Comment', 'String' )

# Create a scanner
scanner = reflex.scanner( State.Default )
scanner.rule( "\s+" )
scanner.rule( "/\*", reflex.set_state( State.Comment ) )
scanner.rule( "[a-zA-Z_][\w_]*", KeywordOrIdent )
scanner.rule( "0x[\da-fA-F]+|\d+", token=TokenType.Integer )
scanner.rule(
"(?:\d+\.\d*|\.\d+)(?:[eE]?[+-]?\d+)|\d+[eE]?[+-]?\d+",
token=TokenType.Real )

# Multi-line comment state
scanner.state( State.Comment )
scanner.rule( "\*/", reflex.set_state( State.Default ) )
scanner.rule( "(?:[^*]|\*(?!/))+" )

# Now, create an instance of the scanner
token_stream = scanner( input_file_iter )
for token in token_stream:
print token

Internally, it creates an array of patterns and actions for each state.
Then when you ask it to create a scanner instance, it combines all of
the patterns into a large regular expression. Input lines are marched
vs. this regex, and if a match succeeds, then the match object's
'lastindenx' property is used to look up the actions to perform in the
array.

RegEx	0	Sep 1, 2022
Finding all regex matches by index?	1	May 30, 2012
Regex replace problem	2	Jan 6, 2022
Speeding up Python's exit	12	Feb 28, 2013
python regex character group matches	2	Sep 17, 2008
On re / regex replacement	3	Aug 28, 2011
Python pyPDF4 code to bookmark pdf based upon date text	1	Jan 18, 2023
returning regex matches as lists	7	Feb 15, 2008

Speeding up multiple regex matches

Talin

Alex Martelli

Fredrik Lundh

Talin

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads