Efficient String Lookup?

Helmut Jarausch · Oct 17, 2004

Chris said:
I have a number of strings, containing wildcards (e.g. 'abc#e#' where #
is anything), which I want to match with a test string (e.g 'abcdef').
What would be the best way for me to store my strings so lookup is as
fast as possible?

A very flexible and fast tool is mxTextTools
http://www.egenix.com/files/python/mxTextTools.html
see also
http://simpleparse.sourceforge.net
which contains a more recent (non recursive) version of the pattern
matching machine
It's faster then regexp (at least sometimes) but much more flexible
Each pattern is stored as (high level) "assembler" for a pattern
matching machine written in C. A Pattern is just a Python tuple
which can be stored (pickled)

--
Helmut Jarausch

Lehrstuhl fuer Numerische Mathematik
RWTH - Aachen University
D 52056 Aachen, Germany

Win Carus · Oct 18, 2004

Chris S. said:
I have a number of strings, containing wildcards (e.g. 'abc#e#' where #
is anything), which I want to match with a test string (e.g 'abcdef').
What would be the best way for me to store my strings so lookup is as
fast as possible?

Regular expressions are probably not the simplest or most
straight-forward option here.

A rather simpler and more straight-forward approach is (a) pre-compute
all the possible strings (i.e., interpolate all possible wildcard
values) along with their associated values; and (b) use a dictionary.
Pre-computation in your case is simple because you only have
single-character wildcards. This has four benefits: it uses a standard
and highly optimized native Python data structure (the dictionary);
it's very fast; it's very simple to update or modify keys and values
("cargo"); and it identifies pattern collisions (i.e, patterns that
have two or more actions associated with them).

On the other hand, if the number of pre-computed strings is "too
large" and there is a lot of prefix and/or suffix similarity in the
patterns, you could consider using an automaton, trie, or a DAG
(directed acyclic graph). Any string-matching automaton or trie will
do automatic prefix pattern compression and will allow you to attach
arbitrary data to the matching state. If there is considerable suffix
similarity, you could also consider a DAG since it ties together
similar suffixes with identical cargo.

BTW: There is no reason to use an Aho-Corasick automaton for this
problem since you don't need its
find-all-matches-in-a-given-sequence-in-a-single-pass functionality
(and that functionality adds substantial complexity to automaton
generation).

Tasks	1	Nov 29, 2022
help with making my code more efficient	22	Dec 21, 2012
efficient interval containment lookup	13	Jan 12, 2009
code review / efficient lookup techniques...	30	Nov 10, 2009
Copy string from 2D array to a 1D array in C	1	Nov 1, 2023
Hello everyone	0	Jul 28, 2023
Efficient way of looging in python	2	Apr 25, 2013
Efficient lookup in list of dictionaries	2	Dec 5, 2005

Efficient String Lookup?

Helmut Jarausch

Win Carus

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads