best practice to avoiding excessive memory usage??

Discussion in 'Perl Misc' started by Chris, Nov 17, 2006.

  1. Chris

    Chris Guest

    I've come across the perl issue of inefficient use of memory when
    dealing with large datasets. What are people's opinions on the best way
    to work around this problem.

    e.g.

    My input file has this layout:
    # Input 1_8:
    0.28496 0.10340 0.33403 0.86176 0.06723 0.15316 0.46009 0.09535 ...
    # Output 1_8:
    0 0 1
    # Input 1_9:
    0.38225 0.98944 0.03805 0.04031 0.05417 0.19623 0.07656 0.07944 ...
    # Output 1_9:
    0 0 1
    # Input 1_10:
    0.11106 0.02792 0.69635 0.37519 0.01326 0.95435 0.15976 0.01406 ...
    # Output 1_10:
    0 0 1

    With ~73000 pairs of input and outputs. The file is ~260Mb is size.
    However when reading the file into an array with the following code
    snippet results in 1.2Gb of memory usage:

    #!/usr/bin/perl

    use strict;
    use warnings;

    my ($patfile) = @ARGV;

    open(my $FH, $patfile) or die;
    my @array;
    my $flag = 0;
    my $i = 0;

    while (<$FH>) {
    $flag = 0 if (/^\# Output/);
    $flag = 1 and next if (/^\# Input/);
    if ($flag) {
    chomp;
    print "$i\n";
    $array[$i] = [ split ];
    ++$i;
    }
    }
    exit;

    I've read about the various work-arounds to access the array via a file
    on disk, but they don't seem to be very conducive for working with
    complex data structures. Can you guys/gals let me know of their
    favourite method to work more efficiently as at the moment I'm just
    reading/writing the files a bit at a time?
    TIA
    Chris, Nov 17, 2006
    #1
    1. Advertising

  2. Chris

    Guest

    Chris <> wrote:
    > I've come across the perl issue of inefficient use of memory when
    > dealing with large datasets. What are people's opinions on the best way
    > to work around this problem.


    That depends entirely on what you are trying to do with the data. You
    haven't shown us anything about what you are trying to do. The code you
    showed us does nohting but take memory and burn CPU cycles.

    > e.g.
    >
    > My input file has this layout:
    > # Input 1_8:
    > 0.28496 0.10340 0.33403 0.86176 0.06723 0.15316 0.46009 0.09535 ...
    > # Output 1_8:
    > 0 0 1
    > # Input 1_9:
    > 0.38225 0.98944 0.03805 0.04031 0.05417 0.19623 0.07656 0.07944 ...
    > # Output 1_9:
    > 0 0 1
    > # Input 1_10:
    > 0.11106 0.02792 0.69635 0.37519 0.01326 0.95435 0.15976 0.01406 ...
    > # Output 1_10:
    > 0 0 1
    >
    > With ~73000 pairs of input and outputs. The file is ~260Mb is size.
    > However when reading the file into an array with the following code
    > snippet results in 1.2Gb of memory usage:
    >
    > #!/usr/bin/perl
    >
    > use strict;
    > use warnings;
    >
    > my ($patfile) = @ARGV;
    >
    > open(my $FH, $patfile) or die;
    > my @array;
    > my $flag = 0;
    > my $i = 0;
    >
    > while (<$FH>) {
    > $flag = 0 if (/^\# Output/);
    > $flag = 1 and next if (/^\# Input/);
    > if ($flag) {
    > chomp;
    > print "$i\n";
    > $array[$i] = [ split ];
    > ++$i;
    > }
    > }
    > exit;


    This program reads in data and does nothing with it. You may as well
    move the "exit" up to just before the "use strict;"

    >
    > I've read about the various work-arounds to access the array via a file
    > on disk,


    Which ones?

    > but they don't seem to be very conducive for working with
    > complex data structures.


    Why not? What problems did you encounter?

    > Can you guys/gals let me know of their
    > favourite method to work more efficiently as at the moment I'm just
    > reading/writing the files a bit at a time?


    Reading and writing the files a bit at a time is an efficient method.
    At least as far as memory is concerned.

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    Usenet Newsgroup Service $9.95/Month 30GB
    , Nov 17, 2006
    #2
    1. Advertising

  3. Chris

    Chris Guest

    wrote:

    > Chris <> wrote:
    >> I've come across the perl issue of inefficient use of memory when
    >> dealing with large datasets. What are people's opinions on the best
    >> way to work around this problem.

    >
    > That depends entirely on what you are trying to do with the data. You
    > haven't shown us anything about what you are trying to do. The code
    > you showed us does nohting but take memory and burn CPU cycles.


    Exactly. I was trying to give an example of the inefficient use of
    memory by perl - nothing more and nothing less.
    [snip]

    >>
    >> I've read about the various work-arounds to access the array via a
    >> file on disk,

    >
    > Which ones?


    The ones in the FAQ. 'How can I make my Perl program take less memory?'

    >> but they don't seem to be very conducive for working with
    >> complex data structures.

    >
    > Why not? What problems did you encounter?


    AFAICS you can either store 1D arrays as lines in file or use some sort
    of DB to manage the data. I may use these in the future, but at the
    moment I'm looking a reasonably straight forward method to make an
    existing program more memory efficient.

    >
    >> Can you guys/gals let me know of their
    >> favourite method to work more efficiently as at the moment I'm just
    >> reading/writing the files a bit at a time?

    >
    > Reading and writing the files a bit at a time is an efficient method.
    > At least as far as memory is concerned.
    >


    OK. That's what I'll do for the time being. However, I'm still
    interested in hearing how other people have overcome this problem.
    Thanks.
    Chris, Nov 17, 2006
    #3
  4. On Fri, 17 Nov 2006 15:43:48 +0000, xhoster wrote:

    >> Can you guys/gals let me know of their
    >> favourite method to work more efficiently as at the moment I'm just
    >> reading/writing the files a bit at a time?

    >
    > Reading and writing the files a bit at a time is an efficient method.
    > At least as far as memory is concerned.


    That is the best method.

    Others include:

    - Add more memory. 1.26G data usage is not that much and memory is cheap.

    - Process the file in stages, producing intermediairy results (and files)
    to make the next stage efficient.

    - Put the data in a database. (Optionally producing a new datafile from
    the database after processing).

    M4
    --
    Redundancy is a great way to introduce more single points of failure.
    Martijn Lievaart, Nov 17, 2006
    #4
  5. Chris

    Guest

    Chris <> wrote:
    > >
    > >> Can you guys/gals let me know of their
    > >> favourite method to work more efficiently as at the moment I'm just
    > >> reading/writing the files a bit at a time?

    > >
    > > Reading and writing the files a bit at a time is an efficient method.
    > > At least as far as memory is concerned.
    > >

    >
    > OK. That's what I'll do for the time being. However, I'm still
    > interested in hearing how other people have overcome this problem.


    I've used probably dozens of different methods to overcome the problem of
    excess memory use, but each one is suited to only specific kinds of
    problems. Changing algorithms to that you don't everything in memory at
    once. Using Perl to transform the problem to something that can be solved
    by using the system sort routine. Changing languages to something more
    memory efficient, either entirely or using Inline or just by using Perl to
    pre-process into a C-friendly format, then using C, then using Perl to
    post-process back into the desired format. Using DBM::Deep. Storing
    "records" as whole strings and splitting them on the fly when needed
    (occasionally using tied arrays or hashes to hide this fact).

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    Usenet Newsgroup Service $9.95/Month 30GB
    , Nov 17, 2006
    #5
  6. Chris

    Ted Zlatanov Guest

    On 17 Nov 2006, wrote:

    > OK. That's what I'll do for the time being. However, I'm still
    > interested in hearing how other people have overcome this problem.


    As the size of your data grows, the solutions grow more complex too.
    Everyone knows how to manage data = 1% of the system memory well. Few
    manage data that is 500% of the system memory well.

    Depending on your application you'll have to find the right solution.
    Usually you'll end up with a database (not necessarily RDBMS) or
    you'll split your data into several manageable pieces, to be processed
    and loaded sequentially on one server or in parallel on multiple
    servers.

    For most problems, using a RDBMS database is the fastest, cheapest,
    simplest way to manage large amounts of data. You see, then you can
    just blame the DBAs when things don't work right :)

    Ted
    Ted Zlatanov, Nov 17, 2006
    #6
  7. On 11/17/2006 08:38 AM, Chris wrote:
    > I've come across the perl issue of inefficient use of memory when
    > dealing with large datasets. What are people's opinions on the best way
    > to work around this problem.
    >
    > e.g.
    >
    > My input file has this layout:
    > # Input 1_8:
    > 0.28496 0.10340 0.33403 0.86176 0.06723 0.15316 0.46009 0.09535 ...
    > # Output 1_8:
    > 0 0 1
    > # Input 1_9:
    > 0.38225 0.98944 0.03805 0.04031 0.05417 0.19623 0.07656 0.07944 ...
    > # Output 1_9:
    > 0 0 1
    > # Input 1_10:
    > 0.11106 0.02792 0.69635 0.37519 0.01326 0.95435 0.15976 0.01406 ...
    > # Output 1_10:
    > 0 0 1
    >
    > With ~73000 pairs of input and outputs. The file is ~260Mb is size.
    > However when reading the file into an array with the following code
    > snippet results in 1.2Gb of memory usage:
    >
    > #!/usr/bin/perl
    >
    > use strict;
    > use warnings;
    >
    > my ($patfile) = @ARGV;
    >
    > open(my $FH, $patfile) or die;
    > my @array;
    > my $flag = 0;
    > my $i = 0;
    >
    > while (<$FH>) {
    > $flag = 0 if (/^\# Output/);
    > $flag = 1 and next if (/^\# Input/);
    > if ($flag) {
    > chomp;
    > print "$i\n";
    > $array[$i] = [ split ];
    > ++$i;
    > }
    > }
    > exit;
    >
    > I've read about the various work-arounds to access the array via a file
    > on disk, but they don't seem to be very conducive for working with
    > complex data structures. Can you guys/gals let me know of their
    > favourite method to work more efficiently as at the moment I'm just
    > reading/writing the files a bit at a time?
    > TIA


    Arrays have a lot of overhead, so don't split the lines into arrays,
    just put them into the main array without splitting.

    When you need the data from a line, split it then.


    --
    Mumia W. (reading news), Nov 17, 2006
    #7
  8. On 2006-11-17 14:38, Chris <> wrote:
    > I've come across the perl issue of inefficient use of memory when
    > dealing with large datasets.


    You aren't the first one. There are modules for dealing with large
    numeric arrays for a reason.

    > What are people's opinions on the best way
    > to work around this problem.


    So far I haven't needed them but searching CPAN for appropriate modules
    would certainly be among the first things I'd try. I have also
    bookmarked something called "PDL - The Perl Data Language" just in case
    I'll ever need it.

    > My input file has this layout:
    > # Input 1_8:
    > 0.28496 0.10340 0.33403 0.86176 0.06723 0.15316 0.46009 0.09535 ...
    > # Output 1_8:
    > 0 0 1
    > # Input 1_9:
    > 0.38225 0.98944 0.03805 0.04031 0.05417 0.19623 0.07656 0.07944 ...
    > # Output 1_9:
    > 0 0 1
    > # Input 1_10:
    > 0.11106 0.02792 0.69635 0.37519 0.01326 0.95435 0.15976 0.01406 ...
    > # Output 1_10:
    > 0 0 1
    >
    > With ~73000 pairs of input and outputs. The file is ~260Mb is size.
    > However when reading the file into an array with the following code
    > snippet results in 1.2Gb of memory usage:


    This is not surprising. Perl scalars take quite a bit of space. Assuming
    no overhead from memory management (which is hardly realistic), a floating
    point number takes 20 bytes, and a string takes 25 + n bytes (where n is
    the length of the string).

    > $array[$i] = [ split ];


    You are storing your values as strings here. Since all your values seem
    to be 7 characters long you could reduce the size of each element from
    32 to 20 bytes, saving almost 40 %, by converting each value into a
    number:

    $array[$i] = [ map { $_ + 0 } split ];

    In reality, the space saving may be less or more, depending on the
    memory management of your perl implementation, the exact shape of your
    data and other conditions.

    Note that this solution is brittle: If you access the elements of your
    arrays in a string context, perl may convert them back into strings, and
    you will need even more space than you needed in the first place.

    hp


    --
    _ | Peter J. Holzer | > Wieso sollte man etwas erfinden was nicht
    |_|_) | Sysadmin WSR | > ist?
    | | | | Was sonst wäre der Sinn des Erfindens?
    __/ | http://www.hjp.at/ | -- P. Einstein u. V. Gringmuth in desd
    Peter J. Holzer, Nov 18, 2006
    #8
  9. Chris

    Chris Guest

    Peter J. Holzer wrote:

    > On 2006-11-17 14:38, Chris <> wrote:
    >> I've come across the perl issue of inefficient use of memory when
    >> dealing with large datasets.

    >
    > You aren't the first one. There are modules for dealing with large
    > numeric arrays for a reason.
    >
    > So far I haven't needed them but searching CPAN for appropriate
    > modules would certainly be among the first things I'd try. I have also
    > bookmarked something called "PDL - The Perl Data Language" just in
    > case I'll ever need it.


    Yes. I've seen that one, it looks very useful indeed. I'm sure I'll use
    it in the future.

    >> With ~73000 pairs of input and outputs. The file is ~260Mb is size.
    >> However when reading the file into an array with the following code
    >> snippet results in 1.2Gb of memory usage:

    >
    > You are storing your values as strings here. Since all your values
    > seem to be 7 characters long you could reduce the size of each element
    > from 32 to 20 bytes, saving almost 40 %, by converting each value into
    > a number:
    >
    > $array[$i] = [ map { $_ + 0 } split ];
    >
    > In reality, the space saving may be less or more, depending on the
    > memory management of your perl implementation, the exact shape of your
    > data and other conditions.


    Indeed, the above makes almost no difference (~100Mb) to my example
    code... :(
    Chris, Nov 20, 2006
    #9
  10. Chris

    Chris Guest

    Thanks for all the useful replies. I now have better ideas for future
    memory management.
    Chris, Nov 20, 2006
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. metfan
    Replies:
    2
    Views:
    4,832
    Robert Olofsson
    Oct 21, 2003
  2. Gabriel Genellina

    Avoiding excessive writes

    Gabriel Genellina, Mar 19, 2009, in forum: Python
    Replies:
    2
    Views:
    224
    John Machin
    Mar 19, 2009
  3. fft1976
    Replies:
    9
    Views:
    317
    fft1976
    Jul 1, 2009
  4. r0g
    Replies:
    8
    Views:
    2,353
    MrJean1
    Nov 30, 2009
  5. Daniel N
    Replies:
    1
    Views:
    218
    Daniel N
    May 22, 2007
Loading...

Share This Page