Does my RE rock, or suck?

B

Bryan Olson

Here's my algorithmic problem:

Given a set of 'trigger' strings, and one 'path' string,
find the longest of the triggers that is a prefix of the
path.

The trigger strings may number in the dozens or hundreds, but
the set is more-or-less static; I'll be matching many paths
against the same set of triggers. Matching a path should be
fast in the average case, and non-awful even if an adversary
designed the path to be evil. This is for a web-server-thing
that registers handlers for paths that begin with certain
strings, like most of the cool web servers do.

I've written a function that takes a list of trigger strings,
and outputs a Python regular expression that should re.find()
the longest matching trigger. For example, given:

triggers = ["Alice", "Bob" "John", "John Smith",
"John Smith Jr", "John Smith Sr"]

My re-generator returns:

re.compile('^(Mary|John( Smith( (Sr|Jr)|)|)|Bob)')

Some RE details: I actually use '(?:' instead of just '(', but
no need to worry about that now. There are never *'s or +'s or
other loop-like things in the generated RE's. The RE's list
choices in reverse-lexicographic order; as the doc says, "REs
separated by "|" are tried from left to right." The order
should ensure finding the longest matching trigger. In (a
couple minutes of testing, it seems to work.


I've been burned using Python's RE module in the past. Before I
sink more time into this, I was wondering if someone could tell
me if the technique makes sense. In particular:

Is this likely to be fast in normal cases? Is there a
clearly better approach?

What would be the worst-case time?

Am I likely to exceed recursion depth?

Any other gotcha? Am I likely to find it when I compile the
RE, or when I try to match a path?


Thanks for any help answering these.
 
M

Michael Hudson

Bryan Olson said:
Here's my algorithmic problem:

Given a set of 'trigger' strings, and one 'path' string,
find the longest of the triggers that is a prefix of the
path.

The trigger strings may number in the dozens or hundreds, but
the set is more-or-less static; I'll be matching many paths
against the same set of triggers. Matching a path should be
fast in the average case, and non-awful even if an adversary
designed the path to be evil. This is for a web-server-thing
that registers handlers for paths that begin with certain
strings, like most of the cool web servers do.

I've written a function that takes a list of trigger strings,
and outputs a Python regular expression that should re.find()
the longest matching trigger. For example, given:

triggers = ["Alice", "Bob" "John", "John Smith",
"John Smith Jr", "John Smith Sr"]

My re-generator returns:

re.compile('^(Mary|John( Smith( (Sr|Jr)|)|)|Bob)')

Some RE details: I actually use '(?:' instead of just '(', but
no need to worry about that now. There are never *'s or +'s or
other loop-like things in the generated RE's. The RE's list
choices in reverse-lexicographic order; as the doc says, "REs
separated by "|" are tried from left to right." The order
should ensure finding the longest matching trigger. In (a
couple minutes of testing, it seems to work.


I've been burned using Python's RE module in the past. Before I
sink more time into this, I was wondering if someone could tell
me if the technique makes sense. In particular:

Is this likely to be fast in normal cases? Is there a
clearly better approach?

Well, the re module is an NFA engine, not a DFA engine, so I'd expect
this re to run in time proportional to the number of triggers.
"Dozens to hundreds" sounds like a fairly small number in context,
but: time it! Only you know what fast enough is.

Here's a cute approach that should have running time independent of
the number of triggers:

def build_dict(prefix_list):
d = {}
for p in prefix_list:
if len(p) > 0:
d.setdefault(p[0], []).append(p[1:])
r = {}
for p in d:
r[p] = build_dict(d[p])
if '' in d[p]:
r[p][''] = 1
return r

def longest_prefix(s, d):
i = 0
j = 0
for c in s:
if c in d:
d = d[c]
i += 1
if '' in d:
j = i
else:
return s[:j]
else:
return s[:j]

if __name__ == '__main__':
d = build_dict(["Alice", "Bob", "John", "John Smith",
"John Smith Jr", "John Smith Sr"])
print longest_prefix("John Smith", d)
print longest_prefix("John Smith", d)

Here's a less cute but probably saner approach:

def prepare(prefix_list):
d = {}
for p in prefix_list:
d.setdefault(len(p), {})[p] = p
l = d.keys()
l.sort()
l.reverse()
return d, l

def longest_prefix2(s, d, l):
for i in l:
r = d.get(s[:i])
if r is not None:
return r
return ''

if __name__ == '__main__':
d, l = prepare(["Alice", "Bob", "John", "John Smith",
"John Smith Jr", "John Smith Sr"])
print longest_prefix2("John Smith", d, l)
print longest_prefix2("John Smith", d, l)

this should run in time proportional to the number of different
lengths of trigger (and use rather less memory than my first version).
What would be the worst-case time?

Am I likely to exceed recursion depth?

Not in 2.4 :) And I doubt it before that, actually.
Any other gotcha? Am I likely to find it when I compile the
RE, or when I try to match a path?


Thanks for any help answering these.

This thread from a year or so back might interest:

http://groups.google.com/[email protected]

Cheers,
mwh
 
B

Bryan Olson

Thanks Michael. I still have more investigating to do.

Michael said:
> Well, the re module is an NFA engine, not a DFA engine, so I'd expect
> this re to run in time proportional to the number of triggers.

I guess I didn't point this out except for my example: I tried
to make the RE efficient even for a search-and-backtrack engine.
Not only are there no loop-like things, there are never multiple
choices that begin with the same character. Any time it
backtracks, it will fail immediately at every choice point.

In my example I compile the RE:

'^(Mary|John( Smith( (Sr|Jr)|)|)|Bob)'

If the RE compiler is smart, searching should involve scanning
the longest prefix that could still match, plus looking at just
one character for each branch-point along the way. (If it uses
a jump-table at decision points, it should be truly linear.)

> "Dozens to hundreds" sounds like a fairly small number in context,

I agree, though hundreds would make the longest RE string I've
ever used.
> but: time it! Only you know what fast enough is.

It seems fast. Testing doesn't allay my fears of a bad worst-
case that an adversary might discover.

> Here's a cute approach that should have running time independent
> of the number of triggers:
[snip]

I've seen that kind of structure called a "trie" (Knuth cites
the name 'trie' to E. Fredkin, CACM 3; 1960). When I generate
my RE, I sort of use the same trie structure. My (trivial)
tests so far indicate that the RE form is faster in typical
cases. Your trie-searcher is nice in that I have much more
confidence in the worst-case run-time.


> Here's a less cute but probably saner approach: [...]
> this should run in time proportional to the number of different
> lengths of trigger (and use rather less memory than my first version).

I'm not convinced. The time to hash a string is proportional to
the length, so I think searches can take time proportional to
the sum of all the trigger lengths.


[I, Bryan asked:]
>
> Not in 2.4 :) And I doubt it before that, actually.

But Python is only up to ... oh, the smiley. I get it.

> This thread from a year or so back might interest:
>
>
http://groups.google.com/[email protected]

Thanks,
 
M

Michael Hudson

Bryan Olson said:
Thanks Michael. I still have more investigating to do.



I guess I didn't point this out except for my example: I tried
to make the RE efficient even for a search-and-backtrack engine.
Not only are there no loop-like things, there are never multiple
choices that begin with the same character. Any time it
backtracks, it will fail immediately at every choice point.

Hmm, you sound like you know more about this sort of thing than I do
:)
In my example I compile the RE:

'^(Mary|John( Smith( (Sr|Jr)|)|)|Bob)'

If the RE compiler is smart, searching should involve scanning
the longest prefix that could still match, plus looking at just
one character for each branch-point along the way. (If it uses
a jump-table at decision points, it should be truly linear.)

Well, I don't know much about the re compiler. It's written in
Python, though, so it shouldn't be *that* hard to check for
yourself...
"Dozens to hundreds" sounds like a fairly small number in context,

I agree, though hundreds would make the longest RE string I've
ever used.
but: time it! Only you know what fast enough is.

It seems fast. Testing doesn't allay my fears of a bad worst-
case that an adversary might discover.
Here's a cute approach that should have running time independent
of the number of triggers:
[snip]

I've seen that kind of structure called a "trie" (Knuth cites
the name 'trie' to E. Fredkin, CACM 3; 1960). When I generate
my RE, I sort of use the same trie structure.

Heh, I'd sort of heard of tries but didn't realize I was implementing
one. I just thought it was a simple minded DFA engine...
My (trivial) tests so far indicate that the RE form is faster in
typical cases. Your trie-searcher is nice in that I have much more
confidence in the worst-case run-time.

Well, the re engine is written in C, so you would hope that it beats
my Python engine...

The reason I original wrote code somewhat like what I put in my first
post is that I needed character-at-a-time processing.
Here's a less cute but probably saner approach: [...]
this should run in time proportional to the number of different
lengths of trigger (and use rather less memory than my first version).

I'm not convinced. The time to hash a string is proportional to
the length, so I think searches can take time proportional to
the sum of all the trigger lengths.

I guess, but the string hashing is done in C and is "probably"
insignificant.
[I, Bryan asked:]
Not in 2.4 :) And I doubt it before that, actually.

But Python is only up to ... oh, the smiley. I get it.

Well, my point was there is no recursion limit in Python 2.4's sre
engine. But from my limited understanding, I don't think this type of
regular expression causes any version of sre to recurse.

Cheers,
mwh
 
?

=?iso-8859-15?Q?Pierre-Fr=E9d=E9ric_Caillaud?=

I think you need some binary search or some form of tree here.

Make a list of prefixes, sort them, store this list somewhere it's quick
to access.
It is then very quick (N log N) to look in that list which key
corresponds to the beginning of a path.
Basically you do a bisection search and, of course you don't have an
exact match but the last key you end up on is your prefix.

I believe there is a btree data type in the Python library.
 
G

Gustavo Niemeyer

Hello Bryan,
Is this likely to be fast in normal cases? Is there a
clearly better approach?

Yes, it should be fast, since no repetitive backtracking will happen.
What would be the worst-case time?

Given the context (person names), your worst case will probably be
failing every name.
Am I likely to exceed recursion depth?

You won't, unless you find a name with thousands of words, and
variants for every word. With Python 2.4, you'll only get in
trouble if you exceed your system's memory.
Any other gotcha? Am I likely to find it when I compile the
RE, or when I try to match a path?

It looks like a sane use of regular expressions.
 
G

Gustavo Niemeyer

[...]
In my example I compile the RE:

'^(Mary|John( Smith( (Sr|Jr)|)|)|Bob)'

If the RE compiler is smart, searching should involve scanning
the longest prefix that could still match, plus looking at just

This won't happen right now. Minimum and maximum lenght is currently
only checked for the whole pattern, not for every branch.
one character for each branch-point along the way. (If it uses
a jump-table at decision points, it should be truly linear.)

OTOH, there's some intelligence for this specific case. The first
character is checked even before "entering" the branch (IOW,
preparing the enviornment for checking the branch), so your
pattern should match really fast.

It will be something close to:

(1) Entering a branch, save current state as necessary.
(2) Check first character, does it match?
(3) If yes, get into the branch, saving state as necessary, and
checking other characters, subpatterns, etc.
(4) No, jump to next alternative, go back to (2).
(5) Yes, leave the current branch with success, restoring states
as necessary.
 
B

Bryan Olson

Pierre-Frédéric Caillaud said:
>
> I think you need some binary search or some form of tree here.
>
> Make a list of prefixes, sort them, store this list somewhere it's
> quick to access.
> It is then very quick (N log N) to look in that list which key
> corresponds to the beginning of a path.
> Basically you do a bisection search and, of course you don't have
> an exact match but the last key you end up on is your prefix.

That doesn't quite work for my problem. Consider the list:

'John',
'Johnson',
'Jones'

A search for the string 'Johnston' (note the 't') will locate
the point between 'Johnson' and 'Jones'. 'Johnson' shares a
long prefix with 'Johnston', but I want the longest string in
the list that *is* a prefix of 'Johnston', and in this case that
is 'John'.
 
B

Bryan Olson

Thanks for responses Gustavo. That's most of what I need to
know.

Gustavo Niemeyer wrote:
[...]
> Given the context (person names), your worst case will probably be
> failing every name.

The names were just for demo. The actual application is to
match resource paths with the most-specific handler.

[...]
> It looks like a sane use of regular expressions.

Since it's not obviously insane, I'll include the code below,
in case anyone else wants to use it or QA it for free.


--Bryan


"""
The Problem: We're given a set of "start strings". Later
we'll be called (many times) with a "path string", and we
need to find the longest of the start strings that is a
prefix of the given path string.

Method: Build a Python regular expression (RE) from the
start strings, such that re.search(path) will match the
desired prefix. The RE has no loop-like things, and never
has two branches beginning with the same character.
Searching should be fast in all cases.

"""

import re


def build_prefix_re(str_list):
""" Given a sequence of strings, return a compiled RE such
that build_prefix_re(str_list).search(x) will match the
longest string in str_list that is a prefix of x.
"""

def trie_insert(map, str):
if str:
submap = map.setdefault(str[0], {})
trie_insert(submap, str[1:])
else:
map[""] = {}

def sequentialize(map, lst):
if map:
keys = map.keys()
# Order so that longer matches are first
keys.sort()
keys.reverse()
lst.append('(?:')
seperator = ''
for key in keys:
lst.append(seperator + re.escape(key))
submap = map[key]
while len(submap) == 1:
# While follow-set is singleton, condense.
key = submap.keys()[0]
submap = submap[key]
lst.append(re.escape(key))
sequentialize(submap, lst)
seperator = '|'
lst.append(')')

map = {}
for str in str_list:
trie_insert(map, str)
lst = ['^']
sequentialize(map, lst)
re_str = "".join(lst)
# print "Prefix finding RE is: '%s'\n" % re_str
return re.compile(re_str)


if __name__ == '__main__':
slist = ["a.b/cde/fg", "bchij", "bivwd", "cdj", "cdjwv", "cdjxi"]
prematcher = build_prefix_re(slist)
for str in slist:
match = prematcher.search(str)
assert(match.group(0) == str)
s = str + 'extraneous'
match = prematcher.search(s)
assert(match.group(0) == str)
for str in ('', 'cd', 'bb', 'cdi', 'xbchij', 'bchiij'):
match = prematcher.search(str)
assert not match, "BAD, found '%s'" % str
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top