"Commutative" regexps?

J Krugman · May 28, 2004

Suppose that I have three different regexps, for example:

my $r1 = qr/foo/;
my $r2 = qr/bar/;
my $r3 = qr/baz/;

(although not necessarily this simple). If I wanted to find all
the documents in an array of documents that matched the three
regexps, I'd just do:

for my $doc (@docs) {
if (/$r1/s && /$r2/s && /$r3/s) {
go_to_town($_);
}
}

But what if I wanted to impose a constraint on how far any two
regexps can be from each other (e.g. for implementing a NEAR query
operator). In that case I'd need expressions

/$r1(.*?)$r2/s

and test for the length of $1. The problem is that now I'd need
6 tests like this like. If instead of 3 regexps we had N, the
number of such tests required would be N*(N-1)/2, i.e. quadratic
growth in N. But this would be a very inefficient solution. After
all, after /($r1|$r2|$r3)/s matched, and say the match was $r2, we
know that we are interested only in the distance between this match
and the next match of /($r1|$r3)/s. And once this matched, say to
$r3, we are only interested in the distance between this match and
the next match to /$r1/s. This approach is linear in N. The
problem is that I can't figure out how to code it in a way that
preserves this linear property. The first obstacle is how to
determine which of the regexps in a disjunction (e.g. /($r1|$r2|$r3)/)
actually matched, without having to perform O(N) tests (which would
put me back in O(N^2) territory for the whole algorithm).

Of course, if there were a way to tell Perl to look for lines that
matched $r1, $r2, and $r3 in *any* order, coding this would be
relatively simple (though the results not necessarily faster). Is
there any way to coax the regexp engine to handle such "commutative
regexps"?

Any suggestions on this would be much appreciated.

jill

Anno Siegel · May 28, 2004

J Krugman said:
Suppose that I have three different regexps, for example:

my $r1 = qr/foo/;
my $r2 = qr/bar/;
my $r3 = qr/baz/;

(although not necessarily this simple). If I wanted to find all
the documents in an array of documents that matched the three
regexps, I'd just do:

for my $doc (@docs) {
if (/$r1/s && /$r2/s && /$r3/s) {
go_to_town($_);
}
}

But what if I wanted to impose a constraint on how far any two
regexps can be from each other (e.g. for implementing a NEAR query
operator). In that case I'd need expressions

/$r1(.*?)$r2/s

and test for the length of $1. The problem is that now I'd need
6 tests like this like. If instead of 3 regexps we had N, the
number of such tests required would be N*(N-1)/2, i.e. quadratic
growth in N. But this would be a very inefficient solution. After
all, after /($r1|$r2|$r3)/s matched, and say the match was $r2, we
know that we are interested only in the distance between this match
and the next match of /($r1|$r3)/s. And once this matched, say to
$r3, we are only interested in the distance between this match and
the next match to /$r1/s. This approach is linear in N. The
problem is that I can't figure out how to code it in a way that
preserves this linear property. The first obstacle is how to
determine which of the regexps in a disjunction (e.g. /($r1|$r2|$r3)/)
actually matched, without having to perform O(N) tests (which would
put me back in O(N^2) territory for the whole algorithm).

Of course, if there were a way to tell Perl to look for lines that
matched $r1, $r2, and $r3 in *any* order, coding this would be
relatively simple (though the results not necessarily faster). Is
there any way to coax the regexp engine to handle such "commutative
regexps"?

No, and-matches are notoriously hard to code in a single regex.

I'd find all matches of any of the regexes and then see if the substring
of given length starting there contains the others. Something like this
(tested, but not debugged):

sub match_all_near {
for ( shift ) { # document text in $_
my ( $width, @regex) = @_;
my $any = join '|', @regex;
while ( /$any/g ) {
my $where = $-[ 0];
for ( substr( $_, $where, $width) ) { # substring in $_
my $match = 1;
for my $re ( @regex ) {
$match &&= /$re/;
last unless $match;
}
print "match at $where:\n$_\n" if $match;
}
}
}
}

The inner loop I simply checks all regexes again, even though it is
known that one already matched. It's not worth the hassle to skip it.

The way it is written it reports all matches, which includes overlapping
ones. If only (the first) success must be reported, that is no problem.
Otherwise, pos() could be set to the end of the substring after each
match.

It's basically linear in n (of regexes). For large n, the alternation
in /$any/ may dominate run time -- I'm not sure how alternations are
handled.

Anno

Ilya Zakharevich · May 28, 2004

[A complimentary Cc of this posting was sent to
J Krugman

/$r1(.*?)$r2/s

and test for the length of $1. The problem is that now I'd need
6 tests like this like. If instead of 3 regexps we had N, the
number of such tests required would be N*(N-1)/2, i.e. quadratic
growth in N.

If you allow overlapping, then you can just AND the expressions:

'\A(?:' . join( '|', map ".*?$_", @rex ) . ')'

(untested; qr/./ may need to be //s'ed).

But this would be a very inefficient solution. After
all, after /($r1|$r2|$r3)/s matched, and say the match was $r2, we
know that we are interested only in the distance between this match
and the next match of /($r1|$r3)/s. And once this matched, say to
$r3, we are only interested in the distance between this match and
the next match to /$r1/s.

If you want to match all them at a small distance (and allow
overlapping), you can have a "restricted AND" (drop \A, and restrict .*? ):

'(?:' . join( '|', map ".{0,64}?$_", @rex ) . ')'

(untested). I'm not sure whether this is sufficiently efficient for
your needs - but you should be able to calculate complexity yourselves...

Hope this helps,
Ilya

Greg Bacon · May 28, 2004

: [...]
: Of course, if there were a way to tell Perl to look for lines that
: matched $r1, $r2, and $r3 in *any* order, coding this would be
: relatively simple (though the results not necessarily faster). Is
: there any way to coax the regexp engine to handle such "commutative
: regexps"?

You can match in any order using positive lookahead assertions:

if (/(?=.*$r1)(?=.*$r2)(?=.*$r3)/) { ... }

What if you knew the starting and ending indices of a match?

sub startend {
my $pat = shift;
my $str = shift;

return unless $str =~ /$pat/gc;

# compute length($&) with $+[0] - $-[0] (see perlvar)
# to avoid even mentioning $& and to avoid adding
# parentheses that might throw off $pat's backrefs
(pos($str) - $+[0] + $-[0], pos($str) - 1);
}

I don't know if it's possible to find where a lookahead match started
and stopped.

Hope this helps,
Greg

Change array shapes	1	Oct 28, 2022
MDP Modeling in GAMS	1	Mar 10, 2023
Distribution	6	Mar 20, 2012
Simple Processor VHDL Doubt	0	May 24, 2011
splitting an XML file on the basis on basis of XML tags	14	Apr 2, 2008
Java matrix problem	3	Sep 10, 2023
Data Register Block	0	Mar 7, 2009
Help please	8	Jul 7, 2023

"Commutative" regexps?

J Krugman

Anno Siegel

Ilya Zakharevich

Greg Bacon

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads