J
J Krugman
Suppose that I have three different regexps, for example:
my $r1 = qr/foo/;
my $r2 = qr/bar/;
my $r3 = qr/baz/;
(although not necessarily this simple). If I wanted to find all
the documents in an array of documents that matched the three
regexps, I'd just do:
for my $doc (@docs) {
if (/$r1/s && /$r2/s && /$r3/s) {
go_to_town($_);
}
}
But what if I wanted to impose a constraint on how far any two
regexps can be from each other (e.g. for implementing a NEAR query
operator). In that case I'd need expressions
/$r1(.*?)$r2/s
and test for the length of $1. The problem is that now I'd need
6 tests like this like. If instead of 3 regexps we had N, the
number of such tests required would be N*(N-1)/2, i.e. quadratic
growth in N. But this would be a very inefficient solution. After
all, after /($r1|$r2|$r3)/s matched, and say the match was $r2, we
know that we are interested only in the distance between this match
and the next match of /($r1|$r3)/s. And once this matched, say to
$r3, we are only interested in the distance between this match and
the next match to /$r1/s. This approach is linear in N. The
problem is that I can't figure out how to code it in a way that
preserves this linear property. The first obstacle is how to
determine which of the regexps in a disjunction (e.g. /($r1|$r2|$r3)/)
actually matched, without having to perform O(N) tests (which would
put me back in O(N^2) territory for the whole algorithm).
Of course, if there were a way to tell Perl to look for lines that
matched $r1, $r2, and $r3 in *any* order, coding this would be
relatively simple (though the results not necessarily faster). Is
there any way to coax the regexp engine to handle such "commutative
regexps"?
Any suggestions on this would be much appreciated.
jill
my $r1 = qr/foo/;
my $r2 = qr/bar/;
my $r3 = qr/baz/;
(although not necessarily this simple). If I wanted to find all
the documents in an array of documents that matched the three
regexps, I'd just do:
for my $doc (@docs) {
if (/$r1/s && /$r2/s && /$r3/s) {
go_to_town($_);
}
}
But what if I wanted to impose a constraint on how far any two
regexps can be from each other (e.g. for implementing a NEAR query
operator). In that case I'd need expressions
/$r1(.*?)$r2/s
and test for the length of $1. The problem is that now I'd need
6 tests like this like. If instead of 3 regexps we had N, the
number of such tests required would be N*(N-1)/2, i.e. quadratic
growth in N. But this would be a very inefficient solution. After
all, after /($r1|$r2|$r3)/s matched, and say the match was $r2, we
know that we are interested only in the distance between this match
and the next match of /($r1|$r3)/s. And once this matched, say to
$r3, we are only interested in the distance between this match and
the next match to /$r1/s. This approach is linear in N. The
problem is that I can't figure out how to code it in a way that
preserves this linear property. The first obstacle is how to
determine which of the regexps in a disjunction (e.g. /($r1|$r2|$r3)/)
actually matched, without having to perform O(N) tests (which would
put me back in O(N^2) territory for the whole algorithm).
Of course, if there were a way to tell Perl to look for lines that
matched $r1, $r2, and $r3 in *any* order, coding this would be
relatively simple (though the results not necessarily faster). Is
there any way to coax the regexp engine to handle such "commutative
regexps"?
Any suggestions on this would be much appreciated.
jill