Efficient String Lookup?

H

Helmut Jarausch

Chris said:
I have a number of strings, containing wildcards (e.g. 'abc#e#' where #
is anything), which I want to match with a test string (e.g 'abcdef').
What would be the best way for me to store my strings so lookup is as
fast as possible?

A very flexible and fast tool is mxTextTools
http://www.egenix.com/files/python/mxTextTools.html
see also
http://simpleparse.sourceforge.net
which contains a more recent (non recursive) version of the pattern
matching machine
It's faster then regexp (at least sometimes) but much more flexible
Each pattern is stored as (high level) "assembler" for a pattern
matching machine written in C. A Pattern is just a Python tuple
which can be stored (pickled)

--
Helmut Jarausch

Lehrstuhl fuer Numerische Mathematik
RWTH - Aachen University
D 52056 Aachen, Germany
 
W

Win Carus

Chris S. said:
I have a number of strings, containing wildcards (e.g. 'abc#e#' where #
is anything), which I want to match with a test string (e.g 'abcdef').
What would be the best way for me to store my strings so lookup is as
fast as possible?
Regular expressions are probably not the simplest or most
straight-forward option here.

A rather simpler and more straight-forward approach is (a) pre-compute
all the possible strings (i.e., interpolate all possible wildcard
values) along with their associated values; and (b) use a dictionary.
Pre-computation in your case is simple because you only have
single-character wildcards. This has four benefits: it uses a standard
and highly optimized native Python data structure (the dictionary);
it's very fast; it's very simple to update or modify keys and values
("cargo"); and it identifies pattern collisions (i.e, patterns that
have two or more actions associated with them).

On the other hand, if the number of pre-computed strings is "too
large" and there is a lot of prefix and/or suffix similarity in the
patterns, you could consider using an automaton, trie, or a DAG
(directed acyclic graph). Any string-matching automaton or trie will
do automatic prefix pattern compression and will allow you to attach
arbitrary data to the matching state. If there is considerable suffix
similarity, you could also consider a DAG since it ties together
similar suffixes with identical cargo.

BTW: There is no reason to use an Aho-Corasick automaton for this
problem since you don't need its
find-all-matches-in-a-given-sequence-in-a-single-pass functionality
(and that functionality adds substantial complexity to automaton
generation).
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,051
Latest member
CarleyMcCr

Latest Threads

Top