Strange behavior when working with large files

Discussion in 'Perl Misc' started by bjamin, Jul 1, 2005.

  1. bjamin

    bjamin Guest

    I have been working on a strange problem I've been having. I am reading
    a series of large files (50 mb or so) in one at a time with:
    @lines = <FILE>;
    or (same behavior with each)
    while(<FILE>){
    push(@lines, $_);
    }
    The first time I read a file it will read into the array in about 2
    seconds. The second time I try to read a file in (the same size) it
    takes about 20 seconds. Everything is declared locally inside the loop
    so, everything is leaving scope. I am not sure why it is taking so much
    longer the second time.

    I have narrowed the problem down to a few different areas:

    1. It seems that if I read the file into a large scaler by $/ = undef,
    the file gets read faster. So, I assume the slow down is taking place
    inside the spliting of the lines.

    2. If I try to append to one large array, rather then rewritting to a
    different array, the slow down does not occur. So it seems Perl has a
    hard time with the memory it already has but its fine with memory it
    just took from the system?

    3. The problem does not seem to happen in Linux, but I'm working
    Windows.

    Any suggestions for a workaround? Has anyone else seen this? Thanks in
    advance.


    Ben
     
    bjamin, Jul 1, 2005
    #1
    1. Advertisements

  2. wrote in
    Why do you think you need to do that?
    perldoc -q memory

    Sinan
     
    A. Sinan Unur, Jul 1, 2005
    #2
    1. Advertisements

  3. bjamin

    bjamin Guest

    I need it to be in an array because I am deleteing lines and
    re-ordering some lines, so I can't work on anything unless I have the
    whole thing to do comparisons.

    Ben
     
    bjamin, Jul 1, 2005
    #3
  4. Seems so, on my system I get similiar results. If you could narrow your
    problem in a few lines of code, feel free to post this small program.
    This makes it easier to reproduce your problem. Just for testing, I've
    written such a small script for you.


    #!/usr/bin/perl -w
    use strict;
    use warnings;
    use Benchmark;
    my $file = '50mb.txt';
    for ( 1 .. 4 ) {
    print timestr( timeit( 1, sub {
    # local $/ = undef;
    open my $fh, '<', $file or die $!;
    # my @lines = <$fh>;
    my @lines; push @lines, $_ while <$fh>;
    } ) ), "\n";
    }
    __END__


    The file I'm reading here consists of 1.5 million lines (50MB all
    together). I get:

    4 wallclock secs ( 3.98 usr + 0.14 sys = 4.13 CPU) @ 0.24/s (n=1)
    34 wallclock secs (33.89 usr + 0.16 sys = 34.05 CPU) @ 0.03/s (n=1)
    27 wallclock secs (26.17 usr + 0.13 sys = 26.30 CPU) @ 0.04/s (n=1)
    28 wallclock secs (27.77 usr + 0.20 sys = 27.97 CPU) @ 0.04/s (n=1)

    With localizing of $/ enabled (slurp mode), I get:

    1 wallclock secs ( 0.77 usr + 0.09 sys = 0.86 CPU) @ 1.16/s (n=1)
    0 wallclock secs ( 0.72 usr + 0.17 sys = 0.89 CPU) @ 1.12/s (n=1)
    0 wallclock secs ( 0.72 usr + 0.23 sys = 0.95 CPU) @ 1.05/s (n=1)
    1 wallclock secs ( 0.70 usr + 0.23 sys = 0.94 CPU) @ 1.07/s (n=1)

    With "my @lines = <$fh>" instead of the while loop, I get:

    22 wallclock secs (16.13 usr + 5.22 sys = 21.34 CPU) @ 0.05/s (n=1)
    36 wallclock secs (35.38 usr + 0.22 sys = 35.59 CPU) @ 0.03/s (n=1)
    6 wallclock secs ( 5.58 usr + 0.14 sys = 5.72 CPU) @ 0.17/s (n=1)
    37 wallclock secs (36.88 usr + 0.17 sys = 37.05 CPU) @ 0.03/s (n=1)

    Curious, I don't know why the third attempt is breaking ranks.

    I have run my script with another input file, too; one with considerable
    fewer newlines (also 50MB, approx 200,000 lines). I get the following
    result for the loop:

    1 wallclock secs ( 1.34 usr + 0.14 sys = 1.48 CPU) @ 0.67/s (n=1)
    12 wallclock secs (11.45 usr + 0.19 sys = 11.64 CPU) @ 0.09/s (n=1)
    15 wallclock secs (14.48 usr + 0.19 sys = 14.67 CPU) @ 0.07/s (n=1)
    10 wallclock secs (10.45 usr + 0.22 sys = 10.67 CPU) @ 0.09/s (n=1)

    And for the version with "my @lines = <$fh>":

    3 wallclock secs ( 3.06 usr + 0.33 sys = 3.39 CPU) @ 0.29/s (n=1)
    57 wallclock secs (55.86 usr + 0.31 sys = 56.17 CPU) @ 0.02/s (n=1)
    60 wallclock secs (59.20 usr + 0.23 sys = 59.44 CPU) @ 0.02/s (n=1)
    58 wallclock secs (57.39 usr + 0.22 sys = 57.61 CPU) @ 0.02/s (n=1)

    Seems, that Perl needs as more time as longer the lines are. Assuming
    this, I run this script with a 50 MB file with only one newline in the
    middle, whereas all attempts need (nearly) the same time.

    269 wallclock secs (185.00 usr + 81.86 sys = 266.86 CPU) @ 0.00/s (n=1)
    277 wallclock secs (184.42 usr + 87.11 sys = 271.53 CPU) @ 0.00/s (n=1)
    276 wallclock secs (183.98 usr + 86.03 sys = 270.02 CPU) @ 0.00/s (n=1)
    272 wallclock secs (184.74 usr + 85.03 sys = 269.77 CPU) @ 0.00/s (n=1)
    Right. In my example: If I move the declaration "my @lines" in front of
    the for-loop, I get for the first file with 1.5 million lines (just the
    for-loop matters):

    4 wallclock secs ( 3.02 usr + 0.25 sys = 3.27 CPU) @ 0.31/s (n=1)
    3 wallclock secs ( 2.95 usr + 0.31 sys = 3.27 CPU) @ 0.31/s (n=1)
    7 wallclock secs ( 2.86 usr + 0.27 sys = 3.13 CPU) @ 0.32/s (n=1)
    9 wallclock secs ( 3.11 usr + 0.34 sys = 3.45 CPU) @ 0.29/s (n=1)

    Actually this creates an array with 6 million elements. The performance
    penalty in the second half is just because my machine has only 512 MB
    RAM and needs to swap around. Hence the results for the file with only
    200,000 lines is looking much better (no swapping is needed):

    1 wallclock secs ( 1.11 usr + 0.17 sys = 1.28 CPU) @ 0.78/s (n=1)
    2 wallclock secs ( 1.11 usr + 0.17 sys = 1.28 CPU) @ 0.78/s (n=1)
    1 wallclock secs ( 1.03 usr + 0.28 sys = 1.31 CPU) @ 0.76/s (n=1)
    1 wallclock secs ( 1.09 usr + 0.23 sys = 1.33 CPU) @ 0.75/s (n=1)
    I have run this on Windows XP SP2 with ActiveState's Perl 5.8.6.
    I have no suggestions for a workaround ;-(

    Yes, I have seen it now ;-)

    But: It is really necessary to read in the whole file? Would you compare
    the first with the last line in worst cases? Perhaps you could give your
    algorithm a second thought.

    regards,
    fabian
     
    Fabian Pilkowski, Jul 2, 2005
    #4
  5. [reading a large file into an array if memory is already allocated]
    Upgrading to ActiveState's current version 5.8.7 is not solving this
    problem. Without changing anything but the Perl version, I get:

    6 wallclock secs ( 4.50 usr + 0.22 sys = 4.72 CPU) @ 0.21/s (n=1)
    69 wallclock secs (68.27 usr + 0.16 sys = 68.42 CPU) @ 0.01/s (n=1)
    68 wallclock secs (67.30 usr + 0.31 sys = 67.61 CPU) @ 0.01/s (n=1)
    68 wallclock secs (67.30 usr + 0.20 sys = 67.50 CPU) @ 0.01/s (n=1)
    And this turns into:

    21 wallclock secs (17.45 usr + 4.19 sys = 21.64 CPU) @ 0.05/s (n=1)
    255 wallclock secs (252.19 usr + 0.33 sys = 252.51 CPU) @ 0.00/s (n=1)
    264 wallclock secs (255.63 usr + 0.38 sys = 256.00 CPU) @ 0.00/s (n=1)
    261 wallclock secs (254.33 usr + 0.52 sys = 254.84 CPU) @ 0.00/s (n=1)

    It seems, that someone want to prevent you from reading large files into
    an array. But perhaps this slowdown affects other perl stuff too. Up to
    now I thought something would go faster if memory is already allocated.
    Seems to me, Perl isn't just re-using it rather than doing anything else
    before.
    As mentioned, I upgraded to Activestate's Perl 5.8.7 just now. Does
    anyone know (or has any idea) what Perl is doing when re-using already
    allocated memory on windowish systems?

    Or is Windows itself the cause of this behavior? Could anyone reproduce
    this problem with another Perl distribution under Windows?

    regards,
    fabian
     
    Fabian Pilkowski, Jul 2, 2005
    #5
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.