[Bryan Olson]
[Tim Peters]
[Bryan]
I know I'm disagreeing with a lot of smart people in panning
the book.
That's allowed
I read the first edition, minus some engine-specific stuff.
General techniques are what I want, and I didn't see them in the
book. He had things to do and not do, but it didn't add up to
building (ir-)regular expressions with reliably tractable
behavior.
My guess then is that the book moved at too slow a pace for you, and
you got bored. The most valuable general technique he (eventually ;-)
explained he called "unrolling", and consists of writing a regexp in
the form:
normal* (?: special normal* )*
where the sets of characters with which `normal` and `special` can
start are disjoint. Under most "NFA" implementation, such a regexp is
immune to most bad behaviors, whether finding very long matches, or in
searching a very long string that happens not to contain any matches.
Without knowing this, under many implementations it's very easy to end
up writing a regexp that matches quickly when a short match exists,
but "blows up" (stack fault, or some checked internal limit hit) if
only a very long match exists; and/or takes even exponential time to
fail to match when a match doesn't exist.
For example, a direct application is writing a robust regular
expression to match C string literals (also one form of Python string
literal: double-quote character at each end, and backslash escapes,
including backslash-newline to span lines):
" [^"\\\n]* (?: \\[\x00-\xff] [^"\\\n]* )* "
The "normal case" is that it's just a regular old character: not a
double quote, backslash, or newline. The "special case" is a
backslash followed by any character whatsoever. The "normal" and
"special" cases obviously don't intersect, so the interior of this
regexp (i.e., ignoring the double quotes at each end) fits the
"unrolling" pattern to a tee.
As a result, you can use it with reasonably high confidence. For
/why/ that's so, read the rest of his book ;-) In effect, it gives
the search engine only one possible course of action at each character
in the target string. So long as that's "normal", it has to stay in
the "normal" part of the regexp, and as soon as it's "special" it has
to leave the "normal" part. The intent is to eliminate any
possibility for backtracking, thus ensuring predictable runtime and
ensuring that no form of internal backtracking stack needs to be used
(let alone hit its limit).
The unrolling idea was new to me, and as soon as I learned it I
replaced lots of regexps in the Python library with rewritten versions
that demonstrably performed very much better -- even allowed closing
some bug reports concerning extreme cases (extremely long matches, and
unbearable runtime on long strings without any matches).
A different lesson to take from this is that nobody is likely to
stumble into effective "NFA" techniques based on luck or "common
sense". It's a set of obscure & specialized learned skills.