leftmost longest match (of disjunctions)

J

Joerg Schuster

Hello,

The program given below returns the lines:

a
ab

Is there a way to use python regular expressions such that the program
would return the following lines?

ab
ab

########################################################################

import re

rx1 = re.compile("(a|ab)")
rx2 = re.compile("(ab|a)")

out1 = rx1.search("ab")
out2 = rx2.search("ab")

print out1.group(1)
print out2.group(1)

########################################################################

Jörg
 
P

Peter Hansen

Joerg said:
The program given below returns the lines:

a
ab

Is there a way to use python regular expressions such that the program
would return the following lines?

ab
ab

########################################################################

import re

rx1 = re.compile("(a|ab)")
rx2 = re.compile("(ab|a)")

Have you checked the documentation for "re"?

It reads:
"|" A|B, where A and B can be arbitrary REs, creates a regular expression
that will match either A or B. An arbitrary number of REs can be separated
by the "|" in this way. This can be used inside groups (see below) as well.
As the target string is scanned, REs separated by "|" are tried from left to
right. When one pattern completely matches, that branch is accepted. This
means that once A matches, B will not be tested further, even if it would
produce a longer overall match. In other words, the "|" operator is never
greedy.

------

Seems pretty clear and explicit to me. Your example is basically a working
proof of the above code, so I'm not sure what you were expecting differently.

-Peter
 
J

Joerg Schuster

Peter Hansen said:
produce a longer overall match. In other words, the "|" operator is never
greedy.

O.k. Thanks for pointing this out. Maybe I should have formulated my
question differently: Is there a trick (be it dirty or not) to make
"|" greedy in python?


Jörg
 
J

Joerg Schuster

Peter Hansen said:
produce a longer overall match. In other words, the "|" operator is never
greedy.

O.k. Thanks for pointing this out. Maybe I should have formulated my
question differently: Is there a trick (be it dirty or not) to make
"|" greedy in python?


Jörg
 
P

Padraig

Joerg said:
O.k. Thanks for pointing this out. Maybe I should have formulated my
question differently: Is there a trick (be it dirty or not) to make
"|" greedy in python?

sort the re by size first?

Pádraig.
 
J

Joerg Schuster

sort the re by size first?

The point is not to get the match of the longest part of the
disjunction, but to get the match of that part of the disjunction
which is the longest one. (The match of ".*" may be much longer than
the match of "abc", although the latter regex contains more
characters.)

Jörg
 
J

Joerg Schuster

Joerg Schuster said:
The point is not to get the match of the longest part of the
disjunction, but to get the match of that part of the disjunction
which is the longest one. (The match of ".*" may be much longer than
the match of "abc", although the latter regex contains more
characters.)

Jörg

I should have read your mail more carefully. You wrote about sorting
the *re*, not about sorting the regex. Sorry.

Jörg
 
F

Fredrik Lundh

Joerg said:
The point is not to get the match of the longest part of the
disjunction, but to get the match of that part of the disjunction
which is the longest one. (The match of ".*" may be much longer
than the match of "abc", although the latter regex contains more
characters.)

you can use "sre_parse.parse(x).getwidth()" on a subexpression, to
get the shortest/longest possible match.
(1, 65535)

(where >=65535 should be interpreted as "any number")

</F>
 
F

Fredrik Lundh

Joerg said:
I should have read your mail more carefully. You wrote about sorting
the *re*, not about sorting the regex. Sorry.

what's the difference?

</F>
 
D

Diez B. Roggisch

Joerg said:
O.k. Thanks for pointing this out. Maybe I should have formulated my
question differently: Is there a trick (be it dirty or not) to make
"|" greedy in python?

I don't think so - as regexps are finite state automata, their capabilities
are limited to what these machines can do. So they can't backtrack to a
part of the disjunction that was a shorter match if the longer match
suddenly will fail. Consider this example:

rex: a|abc

string: abd

If the state-machine was greedy, it would follow the second path and fail.

The only thing you can do to avoid is is to split your disjunctions and
search separately, and afterwards takt the longest part. That could be done
using a small grammar, that allows you to detect the disjunctions and
create separate rexes for it.

Diez
 
J

Joerg Schuster

Fredrik Lundh said:
what's the difference?


One could suspect that re.search() first matches with a greedy "|", then
stores all the matches of all parts of the disjunction in the
re-object and then returns the first one.

Jörg
 
J

Joerg Schuster

Fredrik Lundh said:
you can use "sre_parse.parse(x).getwidth()" on a subexpression, to
get the shortest/longest possible match.

(1, 65535)

(where >=65535 should be interpreted as "any number")

</F>

Thanks for the tip.
 
J

Joerg Schuster

The only thing you can do to avoid is is to split your disjunctions and
search separately, and afterwards takt the longest part. That could be done
using a small grammar, that allows you to detect the disjunctions and
create separate rexes for it.

Diez

Thanks for the tip.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,766
Messages
2,569,569
Members
45,045
Latest member
DRCM

Latest Threads

Top