Matching multiple subexpressions in a regular expression

S

ShaunJ

If more than one subepxression of a regex could have matched the
string, is it possible to find all the subexpressions that could have
matched? Example...

my $re = qr/([AC]CT)|(AC[CT])/;
'CCT' =~ m/$re/;
print join ',', @-; print "\n";
'ACC' =~ m/$re/;
print join ',', @-; print "\n";
'ACT' =~ m/$re/;
print join ',', @-; print "\n";

Output:
0,0
0,,0
0,0

This shows that for...
CCT: the first subexpression matched
ACC: the second subexpression matched
ACT: the first subexpression matched

However, ACT matched both subexpressions! The ideal result for ACT
would be...
0,0,0
showing that both subexpressions matched. Is this possible without
having to split each subexpression into its own regular expression? My
understanding -- please correct me if I'm wrong -- is that one big
regular expression will run faster than 100 little ones, since the one
big regular expression can be compiled into a single large finite-
state-machine that is more efficient than running 100 little FSM.

Thanks!
Shaun
 
J

John W. Krahn

ShaunJ said:
If more than one subepxression of a regex could have matched the
string, is it possible to find all the subexpressions that could have
matched? Example...

my $re = qr/([AC]CT)|(AC[CT])/;
'CCT' =~ m/$re/;
print join ',', @-; print "\n";
'ACC' =~ m/$re/;
print join ',', @-; print "\n";
'ACT' =~ m/$re/;
print join ',', @-; print "\n";

Output:
0,0
0,,0
0,0

This shows that for...
CCT: the first subexpression matched
ACC: the second subexpression matched
ACT: the first subexpression matched

However, ACT matched both subexpressions! The ideal result for ACT
would be...
0,0,0
showing that both subexpressions matched.

Perl's regular expressions can't do that. They always stop after a
successful match so either ([AC]CT) would match or (AC[CT]) would match
but never both.
Is this possible without
having to split each subexpression into its own regular expression? My
understanding -- please correct me if I'm wrong -- is that one big
regular expression will run faster than 100 little ones,

Your assumption is incorrect.
since the one
big regular expression can be compiled into a single large finite-
state-machine that is more efficient than running 100 little FSM.

That question is answered in perlfaq6:

perldoc -q "How do I efficiently match many regular expressions at once"



John
 
S

ShaunJ

That question is answered in perlfaq6:

perldoc -q "How do I efficiently match many regular expressions at once"

Hi John,

If I structure my program as in the example, using many small regex
instead of one big regex, Perl 5.8.6 runs out of memory and dies:
vm_allocate failed, Out of memory! I have 400'000 regex of exactly 27
characters each, and the input string is one line 100 kB long. The
machine has 2 GB of memory and free disk space, which should be
enough, so I presume the code is somehow leeking memory. It's only a
dozen or so lines long, so I've posted my code below. Can you see an
obvious leak?

Thanks,
Shaun

my @restrings = <REFILE>;
my @re = map { qr/$_/x } @restrings;
while (<>) {
my $i = 0;
foreach my $re (@re) {
$i++;
pos = 0;
while (m/$re/g) {
print $i, "\t",
$LAST_MATCH_START[0] + 1, "\t",
$&, "\n";
pos = $LAST_MATCH_START[0] + 1;
}
}
}
 
I

Ilya Zakharevich

[A complimentary Cc of this posting was sent to
John W. Krahn
my $re = qr/([AC]CT)|(AC[CT])/;
'ACT' =~ m/$re/;
print join ',', @-; print "\n";

Output:
0,0

This shows that for...
ACT: the first subexpression matched

However, ACT matched both subexpressions! The ideal result for ACT
would be...
0,0,0
showing that both subexpressions matched.

Perl's regular expressions can't do that. They always stop after a
successful match so either ([AC]CT) would match or (AC[CT]) would match
but never both.

Perl's regular expressions can do it. You just follow the match by
(?!), and preceed it by (??{code}) which analizes and stores the match
results.

Hope this helps,
Ilya
 
S

ShaunJ

ShaunJ said:
If I structure my program as in the example, using many small regex
instead of one big regex, Perl 5.8.6 runs out of memory and dies:
vm_allocate failed, Out of memory! I have 400'000 regex of exactly 27
characters each, and the input string is one line 100 kB long. [...]
my @restrings = <REFILE>;
my @re = map { qr/$_/x } @restrings;

400'000 precompiled regexes are quite a lot!
Why don't you read and create them in chunks of, say, 1000?

Hi Frank,

400'000 * 27 = 12 MB. I wouldn't say it's that large. It seems
reasonable that the compiled regex would fit in less than one GB of
memory, which is my only requirement.

In any case, the following 9-line code snippet burns through 100MB of
memory a second using Perl 5.8.6! Certainly a memory leak. The only
explanation I can think of is if the m/$re/g expression were
recompiling the regex every time and the previously compiled regex
weren't being discarded.

Thanks,
Shaun

my @restrings = <REFILE>;
my @re = map { qr/$_/x } @restrings;
while (<>) {
foreach my $re (@re) {
while (m/$re/g) {
print $LAST_MATCH_START[0], "\n";
}
}
}
 
S

ShaunJ

No, Perl precompiles the patterns (your 400'000 regexes)
into an internal representation at the moment of qr//,
that is at the beginning of your program.

That is my intention. A quick experiment with `top` shows that the
400'000 regex use 500 MB of memory once they're compiled, which is
fine by me. It's the following loop that leaks memory like crazy, and
it shouldn't use any additional memory. Any ideas why?

Swapping the loops won't have any effect, as there's only one string
(one line in the input file) for the first while(<>) loop.

Cheers,
Shaun
 
X

xhoster

ShaunJ said:
Hi John,

If I structure my program as in the example, using many small regex
instead of one big regex, Perl 5.8.6 runs out of memory and dies:
vm_allocate failed, Out of memory! I have 400'000 regex of exactly 27
characters each, and the input string is one line 100 kB long. The
machine has 2 GB of memory and free disk space, which should be
enough, so I presume the code is somehow leeking memory. It's only a
dozen or so lines long, so I've posted my code below. Can you see an
obvious leak?

Thanks,
Shaun

my @restrings = <REFILE>;
my @re = map { qr/$_/x } @restrings;
while (<>) {
....

Can you produce a version that we can run? (I.e. that doesn't
depend on REFILE or STDIN, which we don't have access to?)

The below doesn't leak on v5.8.3, 5.8.7, or 5.8.8.

use strict;
use warnings;
use English;

my @re = map { qr/$_/x } '000000'..'400000';
push @re, qr/\d/; #just to make sure something matches

foreach
('000000000000000000000000000000'..'000000000000000000000001000000') {
print "$_\n";
my $i = 0;
foreach my $re (@re) {
$i++;
pos = 0;
while (m/$re/g) {
print $i, "\t",
$LAST_MATCH_START[0] + 1, "\t",
$&, "\n";
pos = $LAST_MATCH_START[0] + 1;
}
}
}
__END__

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
 
S

ShaunJ

...

Can you produce a version that we can run? (I.e. that doesn't
depend on REFILE or STDIN, which we don't have access to?)

Yes, see the recent thread 'm// on very long lines leaks memory',
where I gave a small test case. As it turns out, there is a bug in
Perl 5.8.6 (which is shipped with MacOSX 10.4.11 incidentally). Using
either English or $& causes the memory leak. This bug is fixed in
5.10.0.

Cheers,
Shaun
 
U

Uri Guttman

S> Yes, see the recent thread 'm// on very long lines leaks memory',
S> where I gave a small test case. As it turns out, there is a bug in
S> Perl 5.8.6 (which is shipped with MacOSX 10.4.11 incidentally). Using
S> either English or $& causes the memory leak. This bug is fixed in
S> 5.10.0.

it is not a leak (as someone else proved in another post). so don't go
blabbing that it is a leak. it is a well known ram suck but it doesn't
lose the ram like a true leak does.

uri
 
D

Dr.Ruud

ShaunJ schreef:
As it turns out, there is a bug in
Perl 5.8.6 (which is shipped with MacOSX 10.4.11 incidentally). Using
either English or $& causes the memory leak. This bug is fixed in
5.10.0.

It is neither a leak nor a bug. Read perldoc perlre.
 
I

Ilya Zakharevich

[A complimentary Cc of this posting was NOT [per weedlist] sent to
Dr.Ruud
ShaunJ schreef:


It is neither a leak nor a bug. Read perldoc perlre.

If you think it is not a bug, please explain what is the purpose of
the stored information.

Thanks,
Ilya
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,044
Messages
2,570,388
Members
47,052
Latest member
ketan

Latest Threads

Top