How to get the "longest possible" match with Python's RE module?

Licheng Fang · Sep 12, 2006

Basically, the problem is this:
'do'

Python's NFA regexp engine trys only the first option, and happily
rests on that. There's another example:
'oneself'

The Python regular expression engine doesn't exaust all the
possibilities, but in my application I hope to get the longest possible
match, starting from a given point.

Is there a way to do this in Python?

MonkeeSage · Sep 12, 2006

Licheng said:
Basically, the problem is this:

From what I understand, this isn't python specific, it is the expected

behavior of that pattern in any implementation. You are using
alternation, which means "either, or", and you have the shorter
subexpression first, so the condition is satisfied by just 'do' and the
matching terminates.

There's another example:

'oneself'

Again, I don't think this has anything to do with python. You pattern
basically means "match 'one' whether it is followed by 'self' or not,
and whether it is followed by 'selfsufficient' or not". For this
particular example, you'd want something like
"one(self)?(sufficient)?".

I think you could construct a pattern that would do what you want in
python without any problem. If you post a (short) example of your data,
I'm sure someone could help you with it.

Regards,
Jordan

kondal · Sep 12, 2006

Licheng said:
Basically, the problem is this:

'do'

Python's NFA regexp engine trys only the first option, and happily
rests on that. There's another example:

'oneself'

The Python regular expression engine doesn't exaust all the
possibilities, but in my application I hope to get the longest possible
match, starting from a given point.

Is there a way to do this in Python?

This is the way the regexp works python doesn't has anything to do with
it. It starts parsing the data with the pattern given. It returns the
matched string acording the pattern and doesn't go back to find the
other combinations.

So to get all the combinations you would probably require to give
different patterns each time.

Licheng Fang · Sep 12, 2006

MonkeeSage said:
behavior of that pattern in any implementation. You are using
alternation, which means "either, or", and you have the shorter
subexpression first, so the condition is satisfied by just 'do' and the
matching terminates.

Again, I don't think this has anything to do with python. You pattern
basically means "match 'one' whether it is followed by 'self' or not,
and whether it is followed by 'selfsufficient' or not". For this
particular example, you'd want something like
"one(self)?(sufficient)?".

I think you could construct a pattern that would do what you want in
python without any problem. If you post a (short) example of your data,
I'm sure someone could help you with it.

Regards,
Jordan

Hi, according to these regexp engine discussions, it's NOT a behavior
true to any implementation.

http://www.softec.st/en/OpenSource/DevelopersCorner/RegularExpressions/RegularExpressionEngines.html
http://www.softec.st/en/OpenSource/DevelopersCorner/RegularExpressions/RegularExpressionEngines.html

Python's NFA engine reads along the input string, matching it to the
pattern, and backtracking when needed. By contrast a DFA engine, to my
understanding, constructs a DFA and uses it to munch as many characters
as possible. Maybe it's like this:

Pattern: one(self)?(selfsufficient)?

PYTHON'S NFA ENGINE:

one self, none selfsufficient, none
(start)------->((1))------------>((2))----------------------->((3))

DFA ENGINE:

one self
(start)------->((123))------------>((23))
|
|
| selfsufficient
--------------->((3))

I want to know if there is some way to make Python RE behave like grep
does, or do I have to change to another engine?

Paddy · Sep 12, 2006

I want to know if there is some way to make Python RE behave like grep
does, or do I have to change to another engine?

Maybe if you posted a (tested) grep example and data, that does as you
want, the group could better understand what you are asking for?

- Paddy.

Tim Peters · Sep 12, 2006

[Licheng Fang]

...
I want to know if there is some way to make Python RE behave like grep
does,

Not in general, no. The matching strategies couldn't be more
different, and that's both deep and intentional. See Friedl's book
for details:

http://regex.info/

or do I have to change to another engine?

Yes, if POSIX regexp semantics are what you require. Several years
ago there was an extension module for Python supplying POSIX
semantics, but I couldn't find anything current just now in a minute
of searching. You may be more motivated to search longer ;-)

MonkeeSage · Sep 12, 2006

Licheng said:
Hi, according to these regexp engine discussions, it's NOT a behavior
true to any implementation.
[snip]

Well, I just double-checked in ruby (oniguruma regexp engine):

r = Regexp.new("do|dolittle")
puts r.match("dolittle")[0]
# do

r = Regexp.new("one(self)?(sufficient)?")
puts r.match("oneselfsufficient")[0]
# oneself

And perl:

if ("doolittle" =~
/(do|dolittle)/) {
print "$1\n";
# do
}

if ("oneselfsufficient" =~
/(one(self)?(selfsufficient)?)/) {
print "$1\n";
# oneself
}

And Javascript (whatever regexp engine Spidermonkey uses):

var r = new RegExp(/do|dolittle/);
alert("dolittle".match(r)[0]);

var r = new RegExp(/one(self)?(selfsufficient)?/);
alert("oneselfsufficient".match(r)[0]);

So, it seems they are all broken, or python is correct as well.

Regards,
Jordan

MonkeeSage · Sep 12, 2006

MonkeeSage said:
So, it seems they are all broken, or python is correct as well.

Aha, sorry about that Licheng (re: Tim's post). I guess "broken"
depends on if you are expecting perl-compatible behavior or otherwise.
I have my own scripts I use to do (f)grep and sed-like operations, so I
almost never use those programs and forgot about the different pattern
semantics (part of the reason I made my own scripts).

Regards,
Jordan

Licheng Fang · Sep 12, 2006

Oh, please do have a look at the second link I've posted. There's a
table comparing the regexp engines. The engines you've tested probably
all use an NFA implementation.

Licheng said:
Licheng said:

Hi, according to these regexp engine discussions, it's NOT a behavior
true to any implementation.
[snip]

Click to expand...

Well, I just double-checked in ruby (oniguruma regexp engine):

r = Regexp.new("do|dolittle")
puts r.match("dolittle")[0]
# do

r = Regexp.new("one(self)?(sufficient)?")
puts r.match("oneselfsufficient")[0]
# oneself

And perl:

if ("doolittle" =~
/(do|dolittle)/) {
print "$1\n";
# do
}

if ("oneselfsufficient" =~
/(one(self)?(selfsufficient)?)/) {
print "$1\n";
# oneself
}

And Javascript (whatever regexp engine Spidermonkey uses):

var r = new RegExp(/do|dolittle/);
alert("dolittle".match(r)[0]);

var r = new RegExp(/one(self)?(selfsufficient)?/);
alert("oneselfsufficient".match(r)[0]);

So, it seems they are all broken, or python is correct as well.

Regards,
Jordan

MonkeeSage · Sep 12, 2006

Licheng said:
Oh, please do have a look at the second link I've posted. There's a
table comparing the regexp engines. The engines you've tested probably
all use an NFA implementation.

Sorry! *blush* I admit I skipped over your links. I'll have a look now.

BTW, just an idea that may or may not work. What about finding all
matches that meet the absolute baseline pattern, then taking the
longest of them...something like this mabye:

def matcher(strings, pattern):
out = ''
reg = re.compile(pattern)
for string in strings:
match = reg.match(string).group()
if (len(match) >= len(out)): # should this use > or >= ?
out = match
return out # empty is no matches, else longest match

p = ['dodad', 'dolittle', 'dodaday']
print matcher(p, r'do.*')
# dolittle

Just a thought...

Regards,
Jordan

Licheng Fang · Sep 12, 2006

Thank you very much, Tim and Monkee.

In fact, what I'm doing is handle a lot of regular expressions. I
wanted to build VERY LONG regexps part by part and put them all into a
file for easy modification and maintenance. The idea is like this:

(*INT) = \d+
(*DECIMAL) = (*INT)\.(*INT)
(*FACTION) = (*DECIMAL)/(*DECIMAL)
(*NUMERALS) = (*FACTION)|(*DECIMAL)|(*INT)
.... ...

What's inside the sytactically wrong (* and ) is something to be
replaced, and then I wrote a little script to do the string
substitution, to get very long regexps to be compiled. I thought that
might be a way to handle long and complex regexps, and in this course I
encountered the problem with the semantics of '|'.

My approach may sound desperate and funny, but please, do you have any
good idea as to how to handle long and complex regexps?

MonkeeSage · Sep 12, 2006

Or mabye something like this is better:

def matcher(string, pattern):
out = ''
for match in re.findall(r'\S*%s\S*' % pattern, string):
if (len(match) >= len(out)):
out = match
return out

p1 = 'dodad donkeykong dolittle dodaday'
p2 = 'oneself self-serving selfsufficient oneselfsufficient'
print matcher(p1, 'do')
# donkeykong
print matcher(p2, 'self')
# oneselfsufficient

Bryan Olson · Sep 12, 2006

Licheng said:
Oh, please do have a look at the second link I've posted. There's a
table comparing the regexp engines. The engines you've tested probably
all use an NFA implementation.

Unfortunately, the stuff about NFA's is wrong. Friedl's awful
book was the first time I saw this confusion about what NFA is;
I don't know if he originated the mess or just propagated it.

"Nondeterministic finite automata" is well defined in theory
of computation. The set of languages accepted by NFA's is
exactly the same as the set accepted by DFA's.

What Python uses is search-and-backtrack. Unfortunately such
engines don't have much theory behind them, and it's hard to
reason generally about what they do.

Tim Peters · Sep 12, 2006

[Licheng Fang[

...
In fact, what I'm doing is handle a lot of regular expressions. I
wanted to build VERY LONG regexps part by part and put them all into a
file for easy modification and maintenance. The idea is like this:

(*INT) = \d+
(*DECIMAL) = (*INT)\.(*INT)
(*FACTION) = (*DECIMAL)/(*DECIMAL)
(*NUMERALS) = (*FACTION)|(*DECIMAL)|(*INT)
... ...

What's inside the sytactically wrong (* and ) is something to be
replaced, and then I wrote a little script to do the string
substitution, to get very long regexps to be compiled. I thought that
might be a way to handle long and complex regexps, and in this course I
encountered the problem with the semantics of '|'.

My approach may sound desperate and funny, but please, do you have any
good idea as to how to handle long and complex regexps?

My best advice is to find a different approach entirely. For example,
build a parser using parser technology, and use regexps in that /only/
to do gross tokenization ("string of digits", "decimal point", ...):
build the rest with a grammar.

Regexps are a brittle tool, best tolerated in small doses. For an
"NFA" implementation like Python's, you're likely to experience poor
speed when combining many complex regexps, and /especially/ when
alternatives are ambiguous wrt prefixes (and yours are, else you
wouldn't have a problem with "longest match" versus "some shorter
match" to begin with. OTOH, under a "DFA" implementation such as
POSIX grep's, you're likely to experience exponential memory
requirements (DFA implementations can need to build enormous state
machines, tracing out in advance all possible paths through all the
regexps applied to all possible input strings).

Just sounds to me like the wrong tool for the job.

Tim Peters · Sep 12, 2006

[Licheng Fang]
[Bryan Olson]

Unfortunately, the stuff about NFA's is wrong. Friedl's awful
book

Strongly disagree: it's an excellent book about the /pragmatics/ of
using "regular expressions" as most widely implemented. It's not at
all about "regular expressions" in the CompSci sense of the term,
which appears to be your complaint.

was the first time I saw this confusion about what NFA is;
I don't know if he originated the mess or just propagated it.

As far as I could tell at the time, he originated it. I'm not happy
about that either.

"Nondeterministic finite automata" is well defined in theory
of computation. The set of languages accepted by NFA's is
exactly the same as the set accepted by DFA's.

And which has very little to do with "regular expressions" as most
widely implemented -- gimmicks like backreferences are wholly outside
the DFA formalism.

What Python uses is search-and-backtrack. Unfortunately such
engines don't have much theory behind them, and it's hard to
reason generally about what they do.

Yup X 3, and the last is precisely why Friedl's book is valuable for
people using "NFA" implementations: Friedl does a good job of
explaining when and why you're likely to trigger atrocious runtime
performance, and gives practical general techniques for avoiding those
problems. None of that has anything to do with CompSci regexps
either, but is vital knowledge for people hoping to make happy
non-trivial use of Python/Perl/etc regexps.

gatti · Sep 12, 2006

kondal said:
This is the way the regexp works python doesn't has anything to do with
it. It starts parsing the data with the pattern given. It returns the
matched string acording the pattern and doesn't go back to find the
other combinations.

I've recently had the same problem in Java, using automatically
generated regular expressions to find the longest match; I failed on
cases like matching the whole of "Abcdefg", but also the whole of
"AbCdefg" or "ABcdefg", with ([A-Z][a-z])?([A-Z][A-Za-z]{1,10})? .
No systematic way to deal with these corner cases was available, and
unsystematic ways (with greedy and reluctant quantifiers) were too
complex.
I ended up eliminating regular expressions completely and building a
dynamic programming parser that returns the set of all match lengths;
it wasn't hard and it should be even easier in Python.

Lorenzo Gatti

Licheng Fang · Sep 12, 2006

Bryan said:
Unfortunately, the stuff about NFA's is wrong. Friedl's awful
book was the first time I saw this confusion about what NFA is;
I don't know if he originated the mess or just propagated it.

"Nondeterministic finite automata" is well defined in theory
of computation. The set of languages accepted by NFA's is
exactly the same as the set accepted by DFA's.

What Python uses is search-and-backtrack. Unfortunately such
engines don't have much theory behind them, and it's hard to
reason generally about what they do.

Thanks for the valuable information. Indeed, when I read the pages, I
was a little confused about what it meant by 'NFA'.

But I faintly felt, there might be an indirect, and not not very exact
mapping from the search-and-backtrack strategy to NFAs in the sense of
computer science, e.g. a state machine with the capability to be in
several states at one time.

Say, when reading 'oneselfsufficient', the engine goes along the NFA
first to state 1, and then faces the choice between

one self, none selfsufficient, none
(start)------->((1))------------>((2))----------------------->((3))

1) matching 'self',
2) going on to state 2 without matching anything, or
3) just give 'one' as the matching result because state 1 is already a
terminal state

In such situations it always chooses the greedy way. To match more, it
goes to the state 2, munching 'self'. And now it's left with only
'sufficient'. Here it's choices are:

1) matching nothing and going to state 3
2) just give 'oneself' as result because state 2 is also a terminal
state

Again it's greedy, going on to state 3, in hope of matching more. But
here the pattern comes to an end, represented by state 3 as a terminal
state. So the engine gives 'oneself' as result and forgets about its
previously unexplored possibilities, because it only performs backtrack
when a match cannot be found.

I think if the backtrack is carried out in an exaustive way, we may say
the engine trys every possibility on the NFA, though it's not an NFA
itself.

Licheng Fang · Sep 12, 2006

kondal said:
kondal said:

This is the way the regexp works python doesn't has anything to do with
it. It starts parsing the data with the pattern given. It returns the
matched string acording the pattern and doesn't go back to find the
other combinations.

Click to expand...

I've recently had the same problem in Java, using automatically
generated regular expressions to find the longest match; I failed on
cases like matching the whole of "Abcdefg", but also the whole of
"AbCdefg" or "ABcdefg", with ([A-Z][a-z])?([A-Z][A-Za-z]{1,10})? .
No systematic way to deal with these corner cases was available, and
unsystematic ways (with greedy and reluctant quantifiers) were too
complex.
I ended up eliminating regular expressions completely and building a
dynamic programming parser that returns the set of all match lengths;
it wasn't hard and it should be even easier in Python.

Lorenzo Gatti

Thanks. I think make use of the expresiveness of CFG may be better
idea.

Another question: my task is to find in a given string the substrings
that satisfies a particular pattern. That's why the first tool that
came to my mind is regular expression. Parsers, however, only give a
yes/no answer to a given string. To find all substrings with a
particular pattern I may have to try every substring, which may be an
impossible task.

How can I solve this problem?

Paul Rubin · Sep 12, 2006

Licheng Fang said:
I think if the backtrack is carried out in an exaustive way, we may say
the engine trys every possibility on the NFA, though it's not an NFA
itself.

The backtracking engine really can recognize languages that are not
describable by classical regexps, by using backreferences, negation,
etc. But exactly what it can do is nowhere near as well-understood as
what classical regexps can do.

I seem to remember that the fully general problem of recognizing
regexps with negation is very hard, so the backtracking matcher either
has to reject some strings it should really match, or else it has to
bog down horribly with certain kinds of patterns.

gatti · Sep 12, 2006

Licheng said:
Another question: my task is to find in a given string the substrings
that satisfies a particular pattern. That's why the first tool that
came to my mind is regular expression. Parsers, however, only give a
yes/no answer to a given string. To find all substrings with a
particular pattern I may have to try every substring, which may be an
impossible task.

You can collect all successful parser results beginning from each index
in the string; this gives you all matches with that first index.
You could extend to multiple results general bottom-up context-free
language parsing like Earley or Tomita's algorithms; for reasonable
languages most locations can be excluded for most rules at the
beginning, with great performance improvements over trying over and
over again.

Lorenzo Gatti

Python's re module and genealogy problem	10	Jun 11, 2014
[SUMMARY] Longest Repeated Substring (#153)	0	Jan 24, 2008
How to get education and coding job coming from abroad starting new in the US? Advice of courses or places to look?	2	May 18, 2023
Must be a bug in the re module [was: Why this result with the remodule]	0	Nov 3, 2010
how to match u'\uff00' - u'\uff0f' in re module?	3	Jul 11, 2006
Working with named groups in re module	2	Jan 10, 2007
FAQ 6.12 Can I use Perl regular expressions to match balanced text?	0	Jan 9, 2011
re module non-greedy matches broken	12	Apr 3, 2005

How to get the "longest possible" match with Python's RE module?

Licheng Fang

MonkeeSage

kondal

Licheng Fang

Paddy

Tim Peters

MonkeeSage

MonkeeSage

Licheng Fang

MonkeeSage

Licheng Fang

MonkeeSage

Bryan Olson

Tim Peters

Tim Peters

gatti

Licheng Fang

Licheng Fang

Paul Rubin

gatti

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads