How to get the "longest possible" match with Python's RE module?

Fredrik Lundh · Sep 12, 2006

Licheng Fang wrote:

The idea is like this:

(*INT) = \d+
(*DECIMAL) = (*INT)\.(*INT)
(*FACTION) = (*DECIMAL)/(*DECIMAL)
(*NUMERALS) = (*FACTION)|(*DECIMAL)|(*INT)
... ...

What's inside the sytactically wrong (* and ) is something to be
replaced, and then I wrote a little script to do the string
substitution

what's hiding under those ellipses? things like "TERM" and "FACTOR" and
"EXPRESSION", perhaps?

</F>

Frederic Rentsch · Sep 12, 2006

Licheng said:
Basically, the problem is this:

'do'

Python's NFA regexp engine trys only the first option, and happily
rests on that. There's another example:

'oneself'

The Python regular expression engine doesn't exaust all the
possibilities, but in my application I hope to get the longest possible
match, starting from a given point.

Is there a way to do this in Python?

Licheng,

If you need regexes, why not just reverse-sort your expressions? This
seems a lot easier and faster than writing another regex compiler.
Reverse-sorting places the longer ones ahead of the shorter ones.

>>> targets = ['be', 'bee', 'been', 'being']
>>> targets.sort ()
>>> targets.reverse ()
>>> regex = '|'.join (targets)
>>> re.findall (regex, 'Having been a bee in a former life, I don\'t

Click to expand...

Click to expand...

mind being what I am and wouldn\'t want to be a bee ever again.')
['been', 'bee', 'being', 'be', 'bee']

You might also take a look at a stream editor I recently came out with:
http://cheeseshop.python.org/pypi/SE/2.2 beta

It has been well received, especially by newbies, I believe because it
is so simple to use and allows very compact coding.
being what I am and wouldn\'t want to be a bee ever again.')

"Having BEEN a BEE in a former life, I don't mind BEING what I am and wouldn't want to BE a BEE ever again."

Because SE works by precedence on length, the targets can be defined in any order and modular theme sets can be spliced freely to form supersets.

'been,bee,being,be,bee,'

You can do extraction filters, deletion filters, substitutitons in any combination. It does multiple passes and can takes files as input, instead of strings and can output files.
"*INT=int substitute"
"*DECIMAL=decimal substitute"
"*FACTION=faction substitute"
"*NUMERALS=numerals substitute"
# ... etc.
''')

I don't know if that could serve.

Regards

Frederic

Bryan Olson · Sep 12, 2006

Tim said:
[Bryan Olson]

Unfortunately, the stuff about NFA's is wrong. Friedl's awful
book

Click to expand...

Strongly disagree: [...]

I know I'm disagreeing with a lot of smart people in panning
the book.

Yup X 3, and the last is precisely why Friedl's book is valuable for
people using "NFA" implementations: Friedl does a good job of
explaining when and why you're likely to trigger atrocious runtime
performance, and gives practical general techniques for avoiding those
problems. None of that has anything to do with CompSci regexps
either, but is vital knowledge for people hoping to make happy
non-trivial use of Python/Perl/etc regexps.

I read the first edition, minus some engine-specific stuff.
General techniques are what I want, and I didn't see them in the
book. He had things to do and not do, but it didn't add up to
building (ir-)regular expressions with reliably tractable
behavior.

Bryan Olson · Sep 12, 2006

Licheng said:
Basically, the problem is this:

'do'

Python's NFA regexp engine trys only the first option, and happily
rests on that. There's another example:

'oneself'

The Python regular expression engine doesn't exaust all the
possibilities, but in my application I hope to get the longest possible
match, starting from a given point.

Is there a way to do this in Python?

Yes. Here's a way, but it sucks real bad:

def longest_match(re_string, text):
regexp = re.compile('(?:' + re_string + ')$')
while text:
m = regexp.match(text)
if m:
return m
text = text[:-1]
return None

Tim Peters · Sep 12, 2006

[Licheng Fang]

....

[Bryan Olson]

Yes. Here's a way, but it sucks real bad:

def longest_match(re_string, text):
regexp = re.compile('(?:' + re_string + ')$')
while text:
m = regexp.match(text)
if m:
return m
text = text[:-1]
return None

If you want to try something like that, note that the match() method
accepts optional slice arguments, so the "text = text[:-1]" business
can be replaced with much quicker little-integer arithmetic.

Tim Peters · Sep 12, 2006

[Bryan Olson]
[Tim Peters]

Strongly disagree: [...]

Click to expand...

[Bryan]
I know I'm disagreeing with a lot of smart people in panning
the book.

That's allowed

I read the first edition, minus some engine-specific stuff.
General techniques are what I want, and I didn't see them in the
book. He had things to do and not do, but it didn't add up to
building (ir-)regular expressions with reliably tractable
behavior.

My guess then is that the book moved at too slow a pace for you, and
you got bored. The most valuable general technique he (eventually ;-)
explained he called "unrolling", and consists of writing a regexp in
the form:

normal* (?: special normal* )*

where the sets of characters with which `normal` and `special` can
start are disjoint. Under most "NFA" implementation, such a regexp is
immune to most bad behaviors, whether finding very long matches, or in
searching a very long string that happens not to contain any matches.

Without knowing this, under many implementations it's very easy to end
up writing a regexp that matches quickly when a short match exists,
but "blows up" (stack fault, or some checked internal limit hit) if
only a very long match exists; and/or takes even exponential time to
fail to match when a match doesn't exist.

For example, a direct application is writing a robust regular
expression to match C string literals (also one form of Python string
literal: double-quote character at each end, and backslash escapes,
including backslash-newline to span lines):

" [^"\\\n]* (?: \\[\x00-\xff] [^"\\\n]* )* "

The "normal case" is that it's just a regular old character: not a
double quote, backslash, or newline. The "special case" is a
backslash followed by any character whatsoever. The "normal" and
"special" cases obviously don't intersect, so the interior of this
regexp (i.e., ignoring the double quotes at each end) fits the
"unrolling" pattern to a tee.

As a result, you can use it with reasonably high confidence. For
/why/ that's so, read the rest of his book ;-) In effect, it gives
the search engine only one possible course of action at each character
in the target string. So long as that's "normal", it has to stay in
the "normal" part of the regexp, and as soon as it's "special" it has
to leave the "normal" part. The intent is to eliminate any
possibility for backtracking, thus ensuring predictable runtime and
ensuring that no form of internal backtracking stack needs to be used
(let alone hit its limit).

The unrolling idea was new to me, and as soon as I learned it I
replaced lots of regexps in the Python library with rewritten versions
that demonstrably performed very much better -- even allowed closing
some bug reports concerning extreme cases (extremely long matches, and
unbearable runtime on long strings without any matches).

A different lesson to take from this is that nobody is likely to
stumble into effective "NFA" techniques based on luck or "common
sense". It's a set of obscure & specialized learned skills.

Bryan Olson · Sep 13, 2006

Tim said:
[...] The most valuable general technique he (eventually ;-)
explained he called "unrolling", and consists of writing a regexp in
the form:

normal* (?: special normal* )*

where the sets of characters with which `normal` and `special` can
start are disjoint. Under most "NFA" implementation, such a regexp is
immune to most bad behaviors, whether finding very long matches, or in
searching a very long string that happens not to contain any matches.
[...]
As a result, you can use it with reasonably high confidence. For
/why/ that's so, read the rest of his book ;-) In effect, it gives
the search engine only one possible course of action at each character
in the target string. So long as that's "normal", it has to stay in
the "normal" part of the regexp, and as soon as it's "special" it has
to leave the "normal" part. The intent is to eliminate any
possibility for backtracking, thus ensuring predictable runtime and
ensuring that no form of internal backtracking stack needs to be used
(let alone hit its limit).

So how general is that? The books just says "certain common
expressions". I think the real answer is that one can unroll
exactly the true regular expressions. The unrolling is
essentially resolving the states of a DFA by hand.

Bryan Olson · Sep 13, 2006

Licheng said:
Basically, the problem is this:

'do'

Python's NFA regexp engine trys only the first option, and happily
rests on that. There's another example:

'oneself'

The Python regular expression engine doesn't exaust all the
possibilities, but in my application I hope to get the longest possible
match, starting from a given point.

Is there a way to do this in Python?

A reply from Tim Peters reminded me that I once programmed a
simple special case of this problem. In particular, given a
lot of strings, my function returns a regexp that efficiently
matches the longest of prefix equal to any of the strings.
If the application is like the examples, where the target RE
is just the or of strings, the method should work well.

http://groups.google.com/group/comp.lang.python/msg/a22eb274845653e9

Now I'm thinking of a more general version, that handles all
true regular expressions.

gatti · Sep 13, 2006

Frederic Rentsch wrote:

If you need regexes, why not just reverse-sort your expressions? This
seems a lot easier and faster than writing another regex compiler.
Reverse-sorting places the longer ones ahead of the shorter ones.

Unfortunately, not all regular expressions have a fixed match length.
Which is the longest of, for example, /(abc)?def/ and /(def)?ghi/
depends on the input.

Lorenzo Gatti

Frederic Rentsch · Sep 13, 2006

Frederic Rentsch wrote:

Unfortunately, not all regular expressions have a fixed match length.
Which is the longest of, for example, /(abc)?def/ and /(def)?ghi/
depends on the input.

Lorenzo Gatti

Very true! Funny you should remind me, considering that I spent quite
some time upgrading SE to allow regular expressions. Version 1 didn't
and could resolve precedence at compile time. Version 2 resolves
precedence at runtime by length of the matches and should function
correctly in this respect, although it might not function fast enough
for speed-critical applications. But then there is in general a
trade-off between convenience and speed.

Frederic

Frederic Rentsch · Sep 13, 2006

Frederic Rentsch wrote:

Unfortunately, not all regular expressions have a fixed match length.
Which is the longest of, for example, /(abc)?def/ and /(def)?ghi/
depends on the input.

Lorenzo Gatti

Oh yes, and my proposal to reverse-sort the targets was in response to
the OP who wanted to generate a regex from a bunch of strings. It should
work for that.

Frederic

Licheng Fang · Sep 14, 2006

Thank you guys. I've written a CYK parser and realized this is the
right direction. It gives every possible interpretation of the string
and I can retrieve whatever I want.

Tim Peters · Sep 16, 2006

[Tim Peters]

[...] The most valuable general technique [Friedl] (eventually ;-)
explained he called "unrolling", and consists of writing a regexp in
the form:

normal* (?: special normal* )*

where the sets of characters with which `normal` and `special` can
start are disjoint. Under most "NFA" implementation, such a regexp is
immune to most bad behaviors, whether finding very long matches, or in
searching a very long string that happens not to contain any matches.

Click to expand...

[Bryan Olson]

So how general is that?

Between 6 and 7 ;-)

The books just says "certain common expressions". I think the real
answer is that one can unroll exactly the true regular expressions.

I'm sure he didn't have that much in mind; e.g., it's a real strain to
fit a fat alternation of fixed strings into that "pattern"
(cat|dog|wombat|weasel|...).'

The unrolling is essentially resolving the states of a DFA by hand.

Oh yes, for those who've studied the theoretical underpinnings. Most
people have not. Much of what he gives as practical advice amounts to
ways to "trick" an "NFA" engine into doing what a "DFA" engine would
do, but explained from the POV of how a backtracking search engine
works. And it makes /sense/ at that level too, divorced from any
knowledge of how a linear-time automaton might be constructed in
theory; e.g., you "want to" give the search engine only one choice
because you want to avoid the possibility of needing to backtrack
later, not because you're picturing a theoretical linear-time
automaton and striving to mimic its behavior.

That explanation may not work for you, but it's effective for people
who haven't studied the theory here. Such explanations also extend
naturally to gimmicks like backreferences, which have no counterpart
in CompSci regexps.

Python's re module and genealogy problem	10	Jun 11, 2014
Hello all! Noob here with completely unrealistic ambitions. Happy to join the crew and get good enough to help others.	4	Aug 13, 2024
[SUMMARY] Longest Repeated Substring (#153)	0	Jan 24, 2008
How to get education and coding job coming from abroad starting new in the US? Advice of courses or places to look?	2	May 18, 2023
Must be a bug in the re module [was: Why this result with the remodule]	0	Nov 2, 2010
how to match u'\uff00' - u'\uff0f' in re module?	3	Jul 10, 2006
Working with named groups in re module	2	Jan 10, 2007
FAQ 6.12 Can I use Perl regular expressions to match balanced text?	0	Jan 9, 2011

How to get the "longest possible" match with Python's RE module?

Fredrik Lundh

Frederic Rentsch

Bryan Olson

Bryan Olson

Tim Peters

Tim Peters

Bryan Olson

Bryan Olson

gatti

Frederic Rentsch

Frederic Rentsch

Licheng Fang

Tim Peters

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads