"Commutative" regexps?

J

J Krugman

Suppose that I have three different regexps, for example:

my $r1 = qr/foo/;
my $r2 = qr/bar/;
my $r3 = qr/baz/;

(although not necessarily this simple). If I wanted to find all
the documents in an array of documents that matched the three
regexps, I'd just do:

for my $doc (@docs) {
if (/$r1/s && /$r2/s && /$r3/s) {
go_to_town($_);
}
}

But what if I wanted to impose a constraint on how far any two
regexps can be from each other (e.g. for implementing a NEAR query
operator). In that case I'd need expressions

/$r1(.*?)$r2/s

and test for the length of $1. The problem is that now I'd need
6 tests like this like. If instead of 3 regexps we had N, the
number of such tests required would be N*(N-1)/2, i.e. quadratic
growth in N. But this would be a very inefficient solution. After
all, after /($r1|$r2|$r3)/s matched, and say the match was $r2, we
know that we are interested only in the distance between this match
and the next match of /($r1|$r3)/s. And once this matched, say to
$r3, we are only interested in the distance between this match and
the next match to /$r1/s. This approach is linear in N. The
problem is that I can't figure out how to code it in a way that
preserves this linear property. The first obstacle is how to
determine which of the regexps in a disjunction (e.g. /($r1|$r2|$r3)/)
actually matched, without having to perform O(N) tests (which would
put me back in O(N^2) territory for the whole algorithm).

Of course, if there were a way to tell Perl to look for lines that
matched $r1, $r2, and $r3 in *any* order, coding this would be
relatively simple (though the results not necessarily faster). Is
there any way to coax the regexp engine to handle such "commutative
regexps"?

Any suggestions on this would be much appreciated.

jill
 
A

Anno Siegel

J Krugman said:
Suppose that I have three different regexps, for example:

my $r1 = qr/foo/;
my $r2 = qr/bar/;
my $r3 = qr/baz/;

(although not necessarily this simple). If I wanted to find all
the documents in an array of documents that matched the three
regexps, I'd just do:

for my $doc (@docs) {
if (/$r1/s && /$r2/s && /$r3/s) {
go_to_town($_);
}
}

But what if I wanted to impose a constraint on how far any two
regexps can be from each other (e.g. for implementing a NEAR query
operator). In that case I'd need expressions

/$r1(.*?)$r2/s

and test for the length of $1. The problem is that now I'd need
6 tests like this like. If instead of 3 regexps we had N, the
number of such tests required would be N*(N-1)/2, i.e. quadratic
growth in N. But this would be a very inefficient solution. After
all, after /($r1|$r2|$r3)/s matched, and say the match was $r2, we
know that we are interested only in the distance between this match
and the next match of /($r1|$r3)/s. And once this matched, say to
$r3, we are only interested in the distance between this match and
the next match to /$r1/s. This approach is linear in N. The
problem is that I can't figure out how to code it in a way that
preserves this linear property. The first obstacle is how to
determine which of the regexps in a disjunction (e.g. /($r1|$r2|$r3)/)
actually matched, without having to perform O(N) tests (which would
put me back in O(N^2) territory for the whole algorithm).

Of course, if there were a way to tell Perl to look for lines that
matched $r1, $r2, and $r3 in *any* order, coding this would be
relatively simple (though the results not necessarily faster). Is
there any way to coax the regexp engine to handle such "commutative
regexps"?

No, and-matches are notoriously hard to code in a single regex.

I'd find all matches of any of the regexes and then see if the substring
of given length starting there contains the others. Something like this
(tested, but not debugged):

sub match_all_near {
for ( shift ) { # document text in $_
my ( $width, @regex) = @_;
my $any = join '|', @regex;
while ( /$any/g ) {
my $where = $-[ 0];
for ( substr( $_, $where, $width) ) { # substring in $_
my $match = 1;
for my $re ( @regex ) {
$match &&= /$re/;
last unless $match;
}
print "match at $where:\n$_\n" if $match;
}
}
}
}

The inner loop I simply checks all regexes again, even though it is
known that one already matched. It's not worth the hassle to skip it.

The way it is written it reports all matches, which includes overlapping
ones. If only (the first) success must be reported, that is no problem.
Otherwise, pos() could be set to the end of the substring after each
match.

It's basically linear in n (of regexes). For large n, the alternation
in /$any/ may dominate run time -- I'm not sure how alternations are
handled.

Anno
 
I

Ilya Zakharevich

[A complimentary Cc of this posting was sent to
J Krugman
/$r1(.*?)$r2/s

and test for the length of $1. The problem is that now I'd need
6 tests like this like. If instead of 3 regexps we had N, the
number of such tests required would be N*(N-1)/2, i.e. quadratic
growth in N.

If you allow overlapping, then you can just AND the expressions:

'\A(?:' . join( '|', map ".*?$_", @rex ) . ')'

(untested; qr/./ may need to be //s'ed).
But this would be a very inefficient solution. After
all, after /($r1|$r2|$r3)/s matched, and say the match was $r2, we
know that we are interested only in the distance between this match
and the next match of /($r1|$r3)/s. And once this matched, say to
$r3, we are only interested in the distance between this match and
the next match to /$r1/s.

If you want to match all them at a small distance (and allow
overlapping), you can have a "restricted AND" (drop \A, and restrict .*? ):

'(?:' . join( '|', map ".{0,64}?$_", @rex ) . ')'

(untested). I'm not sure whether this is sufficiently efficient for
your needs - but you should be able to calculate complexity yourselves...

Hope this helps,
Ilya
 
G

Greg Bacon

: [...]
: Of course, if there were a way to tell Perl to look for lines that
: matched $r1, $r2, and $r3 in *any* order, coding this would be
: relatively simple (though the results not necessarily faster). Is
: there any way to coax the regexp engine to handle such "commutative
: regexps"?

You can match in any order using positive lookahead assertions:

if (/(?=.*$r1)(?=.*$r2)(?=.*$r3)/) { ... }

What if you knew the starting and ending indices of a match?

sub startend {
my $pat = shift;
my $str = shift;

return unless $str =~ /$pat/gc;

# compute length($&) with $+[0] - $-[0] (see perlvar)
# to avoid even mentioning $& and to avoid adding
# parentheses that might throw off $pat's backrefs
(pos($str) - $+[0] + $-[0], pos($str) - 1);
}

I don't know if it's possible to find where a lookahead match started
and stopped.

Hope this helps,
Greg
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads

Change array shapes 1
MDP Modeling in GAMS 1
Distribution 6
Simple Processor VHDL Doubt 0
splitting an XML file on the basis on basis of XML tags 14
Java matrix problem 3
Data Register Block 0
Help please 8

Members online

Forum statistics

Threads
473,764
Messages
2,569,567
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top