Problem with memory usage in pattern match

Niall Macpherson · Dec 5, 2005

I need to read a line which contains a [ and a ] and then capture the
data between these symbols. I then need to split the captured data
using the open parenthesis as the separator.

The following code does exactly what I want

use strict;
use warnings;

while(<DATA>)
{
if(/.*\[(.*)\]/)
{
my $second_bit = $1;
while ( $second_bit =~ /
\( # match open paren
([^\(]*) # match anything up to next opening paren
/gx )
{
print $1, "\n";
}
}
else
{
print "\n", 'No match';
}
}
__END__
(lose this stuff[(process(this(stuff(bit(by(bit]

And produces the following output

H:\Perl Scripts>splitnoarray.pl
process
this
stuff
bit
by
bit

H:\Perl Scripts>

Which is fine for a short line of input.
However the real data is very large - each line may be up to 1Gb and
the memory usage and performance is terrible.

I had a look at getting rid of the '$second_bit' variable to avoid
doing a very large assignment but if I use $1 directly in the while
loop I get the following

H:\Perl Scripts>splitnoarray.pl
process

H:\Perl Scripts>

Can anyone suggest a better way of doing this which would minimise the
memory usage ?

Thanks in advance

Niall

xhoster · Dec 5, 2005

Niall Macpherson said:
I need to read a line which contains a [ and a ] and then capture the
data between these symbols. I then need to split the captured data
using the open parenthesis as the separator.

If you need to split it, why not "split" it? How many "(" do you expect?

The following code does exactly what I want

use strict;
use warnings;

while(<DATA>)
{
if(/.*\[(.*)\]/)

The captured data will span from the last [ which has a ] after it to the
last ]. Is that necessary? What if you just spanned from the first [ to
the first ] after it? Would that still be correct? (I.e. can your data
have nested [])?

{
my $second_bit = $1;
while ( $second_bit =~ /
\( # match open
paren ([^\(]*) # match
anything up to next opening paren
/gx ) {
print $1, "\n";
}

##Maybe this would be better (or maybe not)

foreach ( split /\(/, $1) {

}
else
{
print "\n", 'No match';
}
}
__END__
(lose this stuff[(process(this(stuff(bit(by(bit]

And produces the following output

H:\Perl Scripts>splitnoarray.pl
process
this
stuff
bit
by
bit

H:\Perl Scripts>

Which is fine for a short line of input.
However the real data is very large - each line may be up to 1Gb and
the memory usage and performance is terrible.

I certainly wouldn't expect otherwise, operating on 1Gb strings.

I had a look at getting rid of the '$second_bit' variable to avoid
doing a very large assignment but if I use $1 directly in the while
loop I get the following

H:\Perl Scripts>splitnoarray.pl
process

That is because both of your regex are fighting over the same $1. Trying
to split $1 rather than operating on it with a capturing regex might be
better.

H:\Perl Scripts>

Can anyone suggest a better way of doing this which would minimise the
memory usage ?

I don't know if it would work in your exact situation, but I might try
something like this:

$/ = '[';
<$fh>; #skip to start of what I want;
$/ = '(';
while (<$fh>) {
my ($stuff,$rest) = split /(])/;
print "$stuff\n";
last if defined $rest;
};

(Although ideally I would reconsider whatever process it is that is making
this abominable file in the first place)

Xho

Anno Siegel · Dec 9, 2005

Niall Macpherson said:
I need to read a line which contains a [ and a ] and then capture the
data between these symbols....

[solution snipped]

Which is fine for a short line of input.
However the real data is very large - each line may be up to 1Gb and
the memory usage and performance is terrible.
[...]

Can anyone suggest a better way of doing this which would minimise the
memory usage ?

It's been a while, but I find this problem intriguing.

If the whole string can't be read in memory we must use a buffer. The
problem is now that buffers can arbitrarily overlap with the bracketed
portions of the string.

Here is my take:

use constant SIZE => 5; # buffer size
my ( $open, $close) = map quotemeta, qw( [ ]); # delimiters

my (
@res, # result array to hold the bracketed portions
$carry, # bracketed portions whose end hasn't been reached
$inside, # flag, set if the last buffer ended inside a pair of []
);

while ( read( DATA, $_, SIZE) ) {

# deal with initial part if inside brackets
if ( $inside ) {
if ( /(.*?)$close/g ) {
# finished with this pair of []
push @res, $carry . $1; # put result away
$inside = 0; # no longer inside
}
else {
# whole buffer goes to $carry
$carry .= $_;
pos = length; # done with this buffer, still inside
}
}

# collect contents of complete pairs of [...] in this buffer
my $pos = pos;
while ( /$open(.*?)$close/g ) {
push @res, $1;
$pos = pos;
}
pos = $pos;

# does another (unfinished) pair of [] start in this buffer?
if ( /$open(.*)/g ) {
$carry = $1;
$inside = 1;
}
}

print "$_\n" for @res;

__DATA__
[j]outside[inside_1]out again[inside 2]out once more[k][l]x[m]boo

This program never holds more than a buffer full of raw data and
works down to a buffer size of 1. It uses $_ as the buffer and
the pos() function to keep track of where in the buffer we stand.
The latter takes a bit of fiddling, there may be better ways.

Anno

C program: memory leak/ segmentation fault/ memory limit exceeded	0	Nov 12, 2022
CGI.pm and Use of uninitialized value in pattern match	21	Jan 21, 2009
Help with pattern matching	20	Apr 11, 2012
Memory usage	13	Jan 31, 2006
Multiple Line Pattern Match problem	7	May 31, 2007
best practice to avoiding excessive memory usage??	9	Nov 17, 2006
FAQ 6.23 How can I match strings with multibyte characters?	0	Jan 11, 2011
Pattern Match With $	3	Oct 6, 2003

Problem with memory usage in pattern match

Niall Macpherson

xhoster

Anno Siegel

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads