Matching multiple subexpressions in a regular expression

ShaunJ · Mar 11, 2008

If more than one subepxression of a regex could have matched the
string, is it possible to find all the subexpressions that could have
matched? Example...

my $re = qr/([AC]CT)|(AC[CT])/;
'CCT' =~ m/$re/;
print join ',', @-; print "\n";
'ACC' =~ m/$re/;
print join ',', @-; print "\n";
'ACT' =~ m/$re/;
print join ',', @-; print "\n";

Output:
0,0
0,,0
0,0

This shows that for...
CCT: the first subexpression matched
ACC: the second subexpression matched
ACT: the first subexpression matched

However, ACT matched both subexpressions! The ideal result for ACT
would be...
0,0,0
showing that both subexpressions matched. Is this possible without
having to split each subexpression into its own regular expression? My
understanding -- please correct me if I'm wrong -- is that one big
regular expression will run faster than 100 little ones, since the one
big regular expression can be compiled into a single large finite-
state-machine that is more efficient than running 100 little FSM.

Thanks!
Shaun

John W. Krahn · Mar 12, 2008

ShaunJ said:
If more than one subepxression of a regex could have matched the
string, is it possible to find all the subexpressions that could have
matched? Example...

my $re = qr/([AC]CT)|(AC[CT])/;
'CCT' =~ m/$re/;
print join ',', @-; print "\n";
'ACC' =~ m/$re/;
print join ',', @-; print "\n";
'ACT' =~ m/$re/;
print join ',', @-; print "\n";

Output:
0,0
0,,0
0,0

This shows that for...
CCT: the first subexpression matched
ACC: the second subexpression matched
ACT: the first subexpression matched

However, ACT matched both subexpressions! The ideal result for ACT
would be...
0,0,0
showing that both subexpressions matched.

Perl's regular expressions can't do that. They always stop after a
successful match so either ([AC]CT) would match or (AC[CT]) would match
but never both.

Is this possible without
having to split each subexpression into its own regular expression? My
understanding -- please correct me if I'm wrong -- is that one big
regular expression will run faster than 100 little ones,

Your assumption is incorrect.

since the one
big regular expression can be compiled into a single large finite-
state-machine that is more efficient than running 100 little FSM.

That question is answered in perlfaq6:

perldoc -q "How do I efficiently match many regular expressions at once"

John

ShaunJ · Mar 12, 2008

On Mar 11 said:
That question is answered in perlfaq6:

perldoc -q "How do I efficiently match many regular expressions at once"

Hi John,

If I structure my program as in the example, using many small regex
instead of one big regex, Perl 5.8.6 runs out of memory and dies:
vm_allocate failed, Out of memory! I have 400'000 regex of exactly 27
characters each, and the input string is one line 100 kB long. The
machine has 2 GB of memory and free disk space, which should be
enough, so I presume the code is somehow leeking memory. It's only a
dozen or so lines long, so I've posted my code below. Can you see an
obvious leak?

Thanks,
Shaun

my @restrings = <REFILE>;
my @re = map { qr/$_/x } @restrings;
while (<>) {
my $i = 0;
foreach my $re (@re) {
$i++;
pos = 0;
while (m/$re/g) {
print $i, "\t",
$LAST_MATCH_START[0] + 1, "\t",
$&, "\n";
pos = $LAST_MATCH_START[0] + 1;
}
}
}

Ilya Zakharevich · Mar 12, 2008

[A complimentary Cc of this posting was sent to
John W. Krahn

my $re = qr/([AC]CT)|(AC[CT])/;
'ACT' =~ m/$re/;
print join ',', @-; print "\n";

Output:
0,0

This shows that for...
ACT: the first subexpression matched

However, ACT matched both subexpressions! The ideal result for ACT
would be...
0,0,0
showing that both subexpressions matched.

Click to expand...

Perl's regular expressions can't do that. They always stop after a
successful match so either ([AC]CT) would match or (AC[CT]) would match
but never both.

Perl's regular expressions can do it. You just follow the match by
(?!), and preceed it by (??{code}) which analizes and stores the match
results.

Hope this helps,
Ilya

ShaunJ · Mar 13, 2008

ShaunJ said:
ShaunJ said:

If I structure my program as in the example, using many small regex
instead of one big regex, Perl 5.8.6 runs out of memory and dies:
vm_allocate failed, Out of memory! I have 400'000 regex of exactly 27
characters each, and the input string is one line 100 kB long. [...]
my @restrings = <REFILE>;
my @re = map { qr/$_/x } @restrings;

Click to expand...

400'000 precompiled regexes are quite a lot!
Why don't you read and create them in chunks of, say, 1000?

Hi Frank,

400'000 * 27 = 12 MB. I wouldn't say it's that large. It seems
reasonable that the compiled regex would fit in less than one GB of
memory, which is my only requirement.

In any case, the following 9-line code snippet burns through 100MB of
memory a second using Perl 5.8.6! Certainly a memory leak. The only
explanation I can think of is if the m/$re/g expression were
recompiling the regex every time and the previously compiled regex
weren't being discarded.

Thanks,
Shaun

my @restrings = <REFILE>;
my @re = map { qr/$_/x } @restrings;
while (<>) {
foreach my $re (@re) {
while (m/$re/g) {
print $LAST_MATCH_START[0], "\n";
}
}
}

ShaunJ · Mar 13, 2008

On Mar 13 said:
No, Perl precompiles the patterns (your 400'000 regexes)
into an internal representation at the moment of qr//,
that is at the beginning of your program.

That is my intention. A quick experiment with `top` shows that the
400'000 regex use 500 MB of memory once they're compiled, which is
fine by me. It's the following loop that leaks memory like crazy, and
it shouldn't use any additional memory. Any ideas why?

Swapping the loops won't have any effect, as there's only one string
(one line in the input file) for the first while(<>) loop.

Cheers,
Shaun

xhoster · Mar 13, 2008

ShaunJ said:
Hi John,

If I structure my program as in the example, using many small regex
instead of one big regex, Perl 5.8.6 runs out of memory and dies:
vm_allocate failed, Out of memory! I have 400'000 regex of exactly 27
characters each, and the input string is one line 100 kB long. The
machine has 2 GB of memory and free disk space, which should be
enough, so I presume the code is somehow leeking memory. It's only a
dozen or so lines long, so I've posted my code below. Can you see an
obvious leak?

Thanks,
Shaun

my @restrings = <REFILE>;
my @re = map { qr/$_/x } @restrings;
while (<>) {

....

Can you produce a version that we can run? (I.e. that doesn't
depend on REFILE or STDIN, which we don't have access to?)

The below doesn't leak on v5.8.3, 5.8.7, or 5.8.8.

use strict;
use warnings;
use English;

my @re = map { qr/$_/x } '000000'..'400000';
push @re, qr/\d/; #just to make sure something matches

foreach
('000000000000000000000000000000'..'000000000000000000000001000000') {
print "$_\n";
my $i = 0;
foreach my $re (@re) {
$i++;
pos = 0;
while (m/$re/g) {
print $i, "\t",
$LAST_MATCH_START[0] + 1, "\t",
$&, "\n";
pos = $LAST_MATCH_START[0] + 1;
}
}
}
__END__

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.

ShaunJ · Mar 13, 2008

...

Can you produce a version that we can run? (I.e. that doesn't
depend on REFILE or STDIN, which we don't have access to?)

Yes, see the recent thread 'm// on very long lines leaks memory',
where I gave a small test case. As it turns out, there is a bug in
Perl 5.8.6 (which is shipped with MacOSX 10.4.11 incidentally). Using
either English or $& causes the memory leak. This bug is fixed in
5.10.0.

Cheers,
Shaun

Uri Guttman · Mar 13, 2008

S> Yes, see the recent thread 'm// on very long lines leaks memory',
S> where I gave a small test case. As it turns out, there is a bug in
S> Perl 5.8.6 (which is shipped with MacOSX 10.4.11 incidentally). Using
S> either English or $& causes the memory leak. This bug is fixed in
S> 5.10.0.

it is not a leak (as someone else proved in another post). so don't go
blabbing that it is a leak. it is a well known ram suck but it doesn't
lose the ram like a true leak does.

uri

Dr.Ruud · Mar 16, 2008

ShaunJ schreef:

As it turns out, there is a bug in
Perl 5.8.6 (which is shipped with MacOSX 10.4.11 incidentally). Using
either English or $& causes the memory leak. This bug is fixed in
5.10.0.

It is neither a leak nor a bug. Read perldoc perlre.

Ilya Zakharevich · Mar 16, 2008

[A complimentary Cc of this posting was NOT [per weedlist] sent to
Dr.Ruud

ShaunJ schreef:

It is neither a leak nor a bug. Read perldoc perlre.

If you think it is not a bug, please explain what is the purpose of
the stored information.

Thanks,
Ilya

Recursion regular expression (xtended)	1	Aug 16, 2010
How do I get the text that is found by a regular expression?	10	Apr 30, 2014
Regular expression problem	13	Mar 10, 2013
FAQ 6.20 What good is "\G" in a regular expression?	0	Mar 3, 2011
FAQ 6.5 I put a regular expression into $/ but it didn't work. What's wrong?	0	Jan 28, 2011
Repeating assertions in regular expression	3	Jan 3, 2012
FAQ 6.24 How do I match a regular expression that's in a variable?	0	Apr 19, 2011
Problem with a regular expression	5	Oct 13, 2006

Matching multiple subexpressions in a regular expression

ShaunJ

John W. Krahn

ShaunJ

Ilya Zakharevich

ShaunJ

ShaunJ

xhoster

ShaunJ

Uri Guttman

Dr.Ruud

Ilya Zakharevich

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads