Problem with memory usage in pattern match

  • Thread starter Niall Macpherson
  • Start date
N

Niall Macpherson

I need to read a line which contains a [ and a ] and then capture the
data between these symbols. I then need to split the captured data
using the open parenthesis as the separator.

The following code does exactly what I want

use strict;
use warnings;

while(<DATA>)
{
if(/.*\[(.*)\]/)
{
my $second_bit = $1;
while ( $second_bit =~ /
\( # match open paren
([^\(]*) # match anything up to next opening paren
/gx )
{
print $1, "\n";
}
}
else
{
print "\n", 'No match';
}
}
__END__
(lose this stuff[(process(this(stuff(bit(by(bit]

And produces the following output

H:\Perl Scripts>splitnoarray.pl
process
this
stuff
bit
by
bit

H:\Perl Scripts>

Which is fine for a short line of input.
However the real data is very large - each line may be up to 1Gb and
the memory usage and performance is terrible.

I had a look at getting rid of the '$second_bit' variable to avoid
doing a very large assignment but if I use $1 directly in the while
loop I get the following

H:\Perl Scripts>splitnoarray.pl
process

H:\Perl Scripts>

Can anyone suggest a better way of doing this which would minimise the
memory usage ?

Thanks in advance

Niall
 
X

xhoster

Niall Macpherson said:
I need to read a line which contains a [ and a ] and then capture the
data between these symbols. I then need to split the captured data
using the open parenthesis as the separator.

If you need to split it, why not "split" it? How many "(" do you expect?
The following code does exactly what I want

use strict;
use warnings;

while(<DATA>)
{
if(/.*\[(.*)\]/)

The captured data will span from the last [ which has a ] after it to the
last ]. Is that necessary? What if you just spanned from the first [ to
the first ] after it? Would that still be correct? (I.e. can your data
have nested [])?
{
my $second_bit = $1;
while ( $second_bit =~ /
\( # match open
paren ([^\(]*) # match
anything up to next opening paren
/gx ) {
print $1, "\n";
}


##Maybe this would be better (or maybe not)

foreach ( split /\(/, $1) {

}
else
{
print "\n", 'No match';
}
}
__END__
(lose this stuff[(process(this(stuff(bit(by(bit]

And produces the following output

H:\Perl Scripts>splitnoarray.pl
process
this
stuff
bit
by
bit

H:\Perl Scripts>

Which is fine for a short line of input.
However the real data is very large - each line may be up to 1Gb and
the memory usage and performance is terrible.

I certainly wouldn't expect otherwise, operating on 1Gb strings.
I had a look at getting rid of the '$second_bit' variable to avoid
doing a very large assignment but if I use $1 directly in the while
loop I get the following

H:\Perl Scripts>splitnoarray.pl
process

That is because both of your regex are fighting over the same $1. Trying
to split $1 rather than operating on it with a capturing regex might be
better.
H:\Perl Scripts>

Can anyone suggest a better way of doing this which would minimise the
memory usage ?

I don't know if it would work in your exact situation, but I might try
something like this:

$/ = '[';
<$fh>; #skip to start of what I want;
$/ = '(';
while (<$fh>) {
my ($stuff,$rest) = split /(])/;
print "$stuff\n";
last if defined $rest;
};


(Although ideally I would reconsider whatever process it is that is making
this abominable file in the first place)

Xho
 
A

Anno Siegel

Niall Macpherson said:
I need to read a line which contains a [ and a ] and then capture the
data between these symbols....

[solution snipped]
Which is fine for a short line of input.
However the real data is very large - each line may be up to 1Gb and
the memory usage and performance is terrible.
[...]

Can anyone suggest a better way of doing this which would minimise the
memory usage ?

It's been a while, but I find this problem intriguing.

If the whole string can't be read in memory we must use a buffer. The
problem is now that buffers can arbitrarily overlap with the bracketed
portions of the string.

Here is my take:

use constant SIZE => 5; # buffer size
my ( $open, $close) = map quotemeta, qw( [ ]); # delimiters

my (
@res, # result array to hold the bracketed portions
$carry, # bracketed portions whose end hasn't been reached
$inside, # flag, set if the last buffer ended inside a pair of []
);

while ( read( DATA, $_, SIZE) ) {

# deal with initial part if inside brackets
if ( $inside ) {
if ( /(.*?)$close/g ) {
# finished with this pair of []
push @res, $carry . $1; # put result away
$inside = 0; # no longer inside
}
else {
# whole buffer goes to $carry
$carry .= $_;
pos = length; # done with this buffer, still inside
}
}

# collect contents of complete pairs of [...] in this buffer
my $pos = pos;
while ( /$open(.*?)$close/g ) {
push @res, $1;
$pos = pos;
}
pos = $pos;

# does another (unfinished) pair of [] start in this buffer?
if ( /$open(.*)/g ) {
$carry = $1;
$inside = 1;
}
}

print "$_\n" for @res;


__DATA__
[j]outside[inside_1]out again[inside 2]out once more[k][l]x[m]boo


This program never holds more than a buffer full of raw data and
works down to a buffer size of 1. It uses $_ as the buffer and
the pos() function to keep track of where in the buffer we stand.
The latter takes a bit of fiddling, there may be better ways.

Anno
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,766
Messages
2,569,569
Members
45,045
Latest member
DRCM

Latest Threads

Top