best practice to avoiding excessive memory usage??

Chris · Nov 17, 2006

I've come across the perl issue of inefficient use of memory when
dealing with large datasets. What are people's opinions on the best way
to work around this problem.

e.g.

My input file has this layout:
# Input 1_8:
0.28496 0.10340 0.33403 0.86176 0.06723 0.15316 0.46009 0.09535 ...
# Output 1_8:
0 0 1
# Input 1_9:
0.38225 0.98944 0.03805 0.04031 0.05417 0.19623 0.07656 0.07944 ...
# Output 1_9:
0 0 1
# Input 1_10:
0.11106 0.02792 0.69635 0.37519 0.01326 0.95435 0.15976 0.01406 ...
# Output 1_10:
0 0 1

With ~73000 pairs of input and outputs. The file is ~260Mb is size.
However when reading the file into an array with the following code
snippet results in 1.2Gb of memory usage:

#!/usr/bin/perl

use strict;
use warnings;

my ($patfile) = @ARGV;

open(my $FH, $patfile) or die;
my @array;
my $flag = 0;
my $i = 0;

while (<$FH>) {
$flag = 0 if (/^\# Output/);
$flag = 1 and next if (/^\# Input/);
if ($flag) {
chomp;
print "$i\n";
$array[$i] = [ split ];
++$i;
}
}
exit;

I've read about the various work-arounds to access the array via a file
on disk, but they don't seem to be very conducive for working with
complex data structures. Can you guys/gals let me know of their
favourite method to work more efficiently as at the moment I'm just
reading/writing the files a bit at a time?
TIA

xhoster · Nov 17, 2006

Chris said:
I've come across the perl issue of inefficient use of memory when
dealing with large datasets. What are people's opinions on the best way
to work around this problem.

That depends entirely on what you are trying to do with the data. You
haven't shown us anything about what you are trying to do. The code you
showed us does nohting but take memory and burn CPU cycles.

e.g.

My input file has this layout:
# Input 1_8:
0.28496 0.10340 0.33403 0.86176 0.06723 0.15316 0.46009 0.09535 ...
# Output 1_8:
0 0 1
# Input 1_9:
0.38225 0.98944 0.03805 0.04031 0.05417 0.19623 0.07656 0.07944 ...
# Output 1_9:
0 0 1
# Input 1_10:
0.11106 0.02792 0.69635 0.37519 0.01326 0.95435 0.15976 0.01406 ...
# Output 1_10:
0 0 1

With ~73000 pairs of input and outputs. The file is ~260Mb is size.
However when reading the file into an array with the following code
snippet results in 1.2Gb of memory usage:

#!/usr/bin/perl

use strict;
use warnings;

my ($patfile) = @ARGV;

open(my $FH, $patfile) or die;
my @array;
my $flag = 0;
my $i = 0;

while (<$FH>) {
$flag = 0 if (/^\# Output/);
$flag = 1 and next if (/^\# Input/);
if ($flag) {
chomp;
print "$i\n";
$array[$i] = [ split ];
++$i;
}
}
exit;

This program reads in data and does nothing with it. You may as well
move the "exit" up to just before the "use strict;"

I've read about the various work-arounds to access the array via a file
on disk,

Which ones?

but they don't seem to be very conducive for working with
complex data structures.

Why not? What problems did you encounter?

Can you guys/gals let me know of their
favourite method to work more efficiently as at the moment I'm just
reading/writing the files a bit at a time?

Reading and writing the files a bit at a time is an efficient method.
At least as far as memory is concerned.

Xho

Chris · Nov 17, 2006

That depends entirely on what you are trying to do with the data. You
haven't shown us anything about what you are trying to do. The code
you showed us does nohting but take memory and burn CPU cycles.

Exactly. I was trying to give an example of the inefficient use of
memory by perl - nothing more and nothing less.
[snip]

Which ones?

The ones in the FAQ. 'How can I make my Perl program take less memory?'

Why not? What problems did you encounter?

AFAICS you can either store 1D arrays as lines in file or use some sort
of DB to manage the data. I may use these in the future, but at the
moment I'm looking a reasonably straight forward method to make an
existing program more memory efficient.

Reading and writing the files a bit at a time is an efficient method.
At least as far as memory is concerned.

OK. That's what I'll do for the time being. However, I'm still
interested in hearing how other people have overcome this problem.
Thanks.

Martijn Lievaart · Nov 17, 2006

Reading and writing the files a bit at a time is an efficient method.
At least as far as memory is concerned.

That is the best method.

Others include:

- Add more memory. 1.26G data usage is not that much and memory is cheap.

- Process the file in stages, producing intermediairy results (and files)
to make the next stage efficient.

- Put the data in a database. (Optionally producing a new datafile from
the database after processing).

M4

xhoster · Nov 17, 2006

Chris said:
OK. That's what I'll do for the time being. However, I'm still
interested in hearing how other people have overcome this problem.

I've used probably dozens of different methods to overcome the problem of
excess memory use, but each one is suited to only specific kinds of
problems. Changing algorithms to that you don't everything in memory at
once. Using Perl to transform the problem to something that can be solved
by using the system sort routine. Changing languages to something more
memory efficient, either entirely or using Inline or just by using Perl to
pre-process into a C-friendly format, then using C, then using Perl to
post-process back into the desired format. Using DBM:

eep. Storing
"records" as whole strings and splitting them on the fly when needed
(occasionally using tied arrays or hashes to hide this fact).

Xho

Ted Zlatanov · Nov 17, 2006

OK. That's what I'll do for the time being. However, I'm still
interested in hearing how other people have overcome this problem.

As the size of your data grows, the solutions grow more complex too.
Everyone knows how to manage data = 1% of the system memory well. Few
manage data that is 500% of the system memory well.

Depending on your application you'll have to find the right solution.
Usually you'll end up with a database (not necessarily RDBMS) or
you'll split your data into several manageable pieces, to be processed
and loaded sequentially on one server or in parallel on multiple
servers.

For most problems, using a RDBMS database is the fastest, cheapest,
simplest way to manage large amounts of data. You see, then you can
just blame the DBAs when things don't work right

Ted

Mumia W. (reading news) · Nov 17, 2006

I've come across the perl issue of inefficient use of memory when
dealing with large datasets. What are people's opinions on the best way
to work around this problem.

e.g.

My input file has this layout:
# Input 1_8:
0.28496 0.10340 0.33403 0.86176 0.06723 0.15316 0.46009 0.09535 ...
# Output 1_8:
0 0 1
# Input 1_9:
0.38225 0.98944 0.03805 0.04031 0.05417 0.19623 0.07656 0.07944 ...
# Output 1_9:
0 0 1
# Input 1_10:
0.11106 0.02792 0.69635 0.37519 0.01326 0.95435 0.15976 0.01406 ...
# Output 1_10:
0 0 1

With ~73000 pairs of input and outputs. The file is ~260Mb is size.
However when reading the file into an array with the following code
snippet results in 1.2Gb of memory usage:

#!/usr/bin/perl

use strict;
use warnings;

my ($patfile) = @ARGV;

open(my $FH, $patfile) or die;
my @array;
my $flag = 0;
my $i = 0;

while (<$FH>) {
$flag = 0 if (/^\# Output/);
$flag = 1 and next if (/^\# Input/);
if ($flag) {
chomp;
print "$i\n";
$array[$i] = [ split ];
++$i;
}
}
exit;

I've read about the various work-arounds to access the array via a file
on disk, but they don't seem to be very conducive for working with
complex data structures. Can you guys/gals let me know of their
favourite method to work more efficiently as at the moment I'm just
reading/writing the files a bit at a time?
TIA

Arrays have a lot of overhead, so don't split the lines into arrays,
just put them into the main array without splitting.

When you need the data from a line, split it then.

Peter J. Holzer · Nov 18, 2006

I've come across the perl issue of inefficient use of memory when
dealing with large datasets.

You aren't the first one. There are modules for dealing with large
numeric arrays for a reason.

What are people's opinions on the best way
to work around this problem.

So far I haven't needed them but searching CPAN for appropriate modules
would certainly be among the first things I'd try. I have also
bookmarked something called "PDL - The Perl Data Language" just in case
I'll ever need it.

My input file has this layout:
# Input 1_8:
0.28496 0.10340 0.33403 0.86176 0.06723 0.15316 0.46009 0.09535 ...
# Output 1_8:
0 0 1
# Input 1_9:
0.38225 0.98944 0.03805 0.04031 0.05417 0.19623 0.07656 0.07944 ...
# Output 1_9:
0 0 1
# Input 1_10:
0.11106 0.02792 0.69635 0.37519 0.01326 0.95435 0.15976 0.01406 ...
# Output 1_10:
0 0 1

With ~73000 pairs of input and outputs. The file is ~260Mb is size.
However when reading the file into an array with the following code
snippet results in 1.2Gb of memory usage:

This is not surprising. Perl scalars take quite a bit of space. Assuming
no overhead from memory management (which is hardly realistic), a floating
point number takes 20 bytes, and a string takes 25 + n bytes (where n is
the length of the string).

$array[$i] = [ split ];

You are storing your values as strings here. Since all your values seem
to be 7 characters long you could reduce the size of each element from
32 to 20 bytes, saving almost 40 %, by converting each value into a
number:

$array[$i] = [ map { $_ + 0 } split ];

In reality, the space saving may be less or more, depending on the
memory management of your perl implementation, the exact shape of your
data and other conditions.

Note that this solution is brittle: If you access the elements of your
arrays in a string context, perl may convert them back into strings, and
you will need even more space than you needed in the first place.

hp

Chris · Nov 20, 2006

Peter said:
You aren't the first one. There are modules for dealing with large
numeric arrays for a reason.

So far I haven't needed them but searching CPAN for appropriate
modules would certainly be among the first things I'd try. I have also
bookmarked something called "PDL - The Perl Data Language" just in
case I'll ever need it.

Yes. I've seen that one, it looks very useful indeed. I'm sure I'll use
it in the future.

With ~73000 pairs of input and outputs. The file is ~260Mb is size.
However when reading the file into an array with the following code
snippet results in 1.2Gb of memory usage:

Click to expand...

You are storing your values as strings here. Since all your values
seem to be 7 characters long you could reduce the size of each element
from 32 to 20 bytes, saving almost 40 %, by converting each value into
a number:

$array[$i] = [ map { $_ + 0 } split ];

In reality, the space saving may be less or more, depending on the
memory management of your perl implementation, the exact shape of your
data and other conditions.

Indeed, the above makes almost no difference (~100Mb) to my example
code...

Chris · Nov 20, 2006

Thanks for all the useful replies. I now have better ideas for future
memory management.

C program: memory leak/ segmentation fault/ memory limit exceeded	0	Nov 12, 2022
Alternative to Malloc in C	0	May 3, 2022
"Best practice" for capturing data mid pipeline	3	Jul 15, 2010
usage of os.posix_fadvise	0	May 30, 2013
PHP RSS Feed Aggregator changing to todays date everytime feed is aggregated	1	Jan 11, 2022
Problem with memory usage in pattern match	2	Dec 5, 2005
Memory usage per top 10x usage per heapy	20	Sep 24, 2012
Memory usage	13	Jan 31, 2006

best practice to avoiding excessive memory usage??

Chris

xhoster

Chris

Martijn Lievaart

xhoster

Ted Zlatanov

Mumia W. (reading news)

Peter J. Holzer

Chris

Chris

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads