Strange behavior when working with large files

B

bjamin

I have been working on a strange problem I've been having. I am reading
a series of large files (50 mb or so) in one at a time with:
@lines = <FILE>;
or (same behavior with each)
while(<FILE>){
push(@lines, $_);
}
The first time I read a file it will read into the array in about 2
seconds. The second time I try to read a file in (the same size) it
takes about 20 seconds. Everything is declared locally inside the loop
so, everything is leaving scope. I am not sure why it is taking so much
longer the second time.

I have narrowed the problem down to a few different areas:

1. It seems that if I read the file into a large scaler by $/ = undef,
the file gets read faster. So, I assume the slow down is taking place
inside the spliting of the lines.

2. If I try to append to one large array, rather then rewritting to a
different array, the slow down does not occur. So it seems Perl has a
hard time with the memory it already has but its fine with memory it
just took from the system?

3. The problem does not seem to happen in Linux, but I'm working
Windows.

Any suggestions for a workaround? Has anyone else seen this? Thanks in
advance.


Ben
 
A

A. Sinan Unur

(e-mail address removed) wrote in
I have been working on a strange problem I've been having. I am
reading a series of large files (50 mb or so) in one at a time with:
@lines = <FILE>;
or (same behavior with each)
while(<FILE>){
push(@lines, $_);
}

Why do you think you need to do that?
2. If I try to append to one large array, rather then rewritting to a
different array, the slow down does not occur. So it seems Perl has a
hard time with the memory it already has but its fine with memory it
just took from the system?

perldoc -q memory

Sinan
 
B

bjamin

I need it to be in an array because I am deleteing lines and
re-ordering some lines, so I can't work on anything unless I have the
whole thing to do comparisons.

Ben
 
F

Fabian Pilkowski

I have been working on a strange problem I've been having. I am reading
a series of large files (50 mb or so) in one at a time with:
@lines = <FILE>;
or (same behavior with each)
while(<FILE>){
push(@lines, $_);
}
The first time I read a file it will read into the array in about 2
seconds. The second time I try to read a file in (the same size) it
takes about 20 seconds. Everything is declared locally inside the loop
so, everything is leaving scope. I am not sure why it is taking so much
longer the second time.

I have narrowed the problem down to a few different areas:

1. It seems that if I read the file into a large scaler by $/ = undef,
the file gets read faster. So, I assume the slow down is taking place
inside the spliting of the lines.

Seems so, on my system I get similiar results. If you could narrow your
problem in a few lines of code, feel free to post this small program.
This makes it easier to reproduce your problem. Just for testing, I've
written such a small script for you.


#!/usr/bin/perl -w
use strict;
use warnings;
use Benchmark;
my $file = '50mb.txt';
for ( 1 .. 4 ) {
print timestr( timeit( 1, sub {
# local $/ = undef;
open my $fh, '<', $file or die $!;
# my @lines = <$fh>;
my @lines; push @lines, $_ while <$fh>;
} ) ), "\n";
}
__END__


The file I'm reading here consists of 1.5 million lines (50MB all
together). I get:

4 wallclock secs ( 3.98 usr + 0.14 sys = 4.13 CPU) @ 0.24/s (n=1)
34 wallclock secs (33.89 usr + 0.16 sys = 34.05 CPU) @ 0.03/s (n=1)
27 wallclock secs (26.17 usr + 0.13 sys = 26.30 CPU) @ 0.04/s (n=1)
28 wallclock secs (27.77 usr + 0.20 sys = 27.97 CPU) @ 0.04/s (n=1)

With localizing of $/ enabled (slurp mode), I get:

1 wallclock secs ( 0.77 usr + 0.09 sys = 0.86 CPU) @ 1.16/s (n=1)
0 wallclock secs ( 0.72 usr + 0.17 sys = 0.89 CPU) @ 1.12/s (n=1)
0 wallclock secs ( 0.72 usr + 0.23 sys = 0.95 CPU) @ 1.05/s (n=1)
1 wallclock secs ( 0.70 usr + 0.23 sys = 0.94 CPU) @ 1.07/s (n=1)

With "my @lines = <$fh>" instead of the while loop, I get:

22 wallclock secs (16.13 usr + 5.22 sys = 21.34 CPU) @ 0.05/s (n=1)
36 wallclock secs (35.38 usr + 0.22 sys = 35.59 CPU) @ 0.03/s (n=1)
6 wallclock secs ( 5.58 usr + 0.14 sys = 5.72 CPU) @ 0.17/s (n=1)
37 wallclock secs (36.88 usr + 0.17 sys = 37.05 CPU) @ 0.03/s (n=1)

Curious, I don't know why the third attempt is breaking ranks.

I have run my script with another input file, too; one with considerable
fewer newlines (also 50MB, approx 200,000 lines). I get the following
result for the loop:

1 wallclock secs ( 1.34 usr + 0.14 sys = 1.48 CPU) @ 0.67/s (n=1)
12 wallclock secs (11.45 usr + 0.19 sys = 11.64 CPU) @ 0.09/s (n=1)
15 wallclock secs (14.48 usr + 0.19 sys = 14.67 CPU) @ 0.07/s (n=1)
10 wallclock secs (10.45 usr + 0.22 sys = 10.67 CPU) @ 0.09/s (n=1)

And for the version with "my @lines = <$fh>":

3 wallclock secs ( 3.06 usr + 0.33 sys = 3.39 CPU) @ 0.29/s (n=1)
57 wallclock secs (55.86 usr + 0.31 sys = 56.17 CPU) @ 0.02/s (n=1)
60 wallclock secs (59.20 usr + 0.23 sys = 59.44 CPU) @ 0.02/s (n=1)
58 wallclock secs (57.39 usr + 0.22 sys = 57.61 CPU) @ 0.02/s (n=1)

Seems, that Perl needs as more time as longer the lines are. Assuming
this, I run this script with a 50 MB file with only one newline in the
middle, whereas all attempts need (nearly) the same time.

269 wallclock secs (185.00 usr + 81.86 sys = 266.86 CPU) @ 0.00/s (n=1)
277 wallclock secs (184.42 usr + 87.11 sys = 271.53 CPU) @ 0.00/s (n=1)
276 wallclock secs (183.98 usr + 86.03 sys = 270.02 CPU) @ 0.00/s (n=1)
272 wallclock secs (184.74 usr + 85.03 sys = 269.77 CPU) @ 0.00/s (n=1)
2. If I try to append to one large array, rather then rewritting to a
different array, the slow down does not occur. So it seems Perl has a
hard time with the memory it already has but its fine with memory it
just took from the system?

Right. In my example: If I move the declaration "my @lines" in front of
the for-loop, I get for the first file with 1.5 million lines (just the
for-loop matters):

4 wallclock secs ( 3.02 usr + 0.25 sys = 3.27 CPU) @ 0.31/s (n=1)
3 wallclock secs ( 2.95 usr + 0.31 sys = 3.27 CPU) @ 0.31/s (n=1)
7 wallclock secs ( 2.86 usr + 0.27 sys = 3.13 CPU) @ 0.32/s (n=1)
9 wallclock secs ( 3.11 usr + 0.34 sys = 3.45 CPU) @ 0.29/s (n=1)

Actually this creates an array with 6 million elements. The performance
penalty in the second half is just because my machine has only 512 MB
RAM and needs to swap around. Hence the results for the file with only
200,000 lines is looking much better (no swapping is needed):

1 wallclock secs ( 1.11 usr + 0.17 sys = 1.28 CPU) @ 0.78/s (n=1)
2 wallclock secs ( 1.11 usr + 0.17 sys = 1.28 CPU) @ 0.78/s (n=1)
1 wallclock secs ( 1.03 usr + 0.28 sys = 1.31 CPU) @ 0.76/s (n=1)
1 wallclock secs ( 1.09 usr + 0.23 sys = 1.33 CPU) @ 0.75/s (n=1)
3. The problem does not seem to happen in Linux, but I'm working
Windows.

I have run this on Windows XP SP2 with ActiveState's Perl 5.8.6.
Any suggestions for a workaround? Has anyone else seen this? Thanks in
advance.

I have no suggestions for a workaround ;-(

Yes, I have seen it now ;-)

But: It is really necessary to read in the whole file? Would you compare
the first with the last line in worst cases? Perhaps you could give your
algorithm a second thought.

regards,
fabian
 
F

Fabian Pilkowski

* Fabian Pilkowski said:
* (e-mail address removed) schrieb:

[reading a large file into an array if memory is already allocated]
The file I'm reading here consists of 1.5 million lines (50MB all
together). I get:

4 wallclock secs ( 3.98 usr + 0.14 sys = 4.13 CPU) @ 0.24/s (n=1)
34 wallclock secs (33.89 usr + 0.16 sys = 34.05 CPU) @ 0.03/s (n=1)
27 wallclock secs (26.17 usr + 0.13 sys = 26.30 CPU) @ 0.04/s (n=1)
28 wallclock secs (27.77 usr + 0.20 sys = 27.97 CPU) @ 0.04/s (n=1)

Upgrading to ActiveState's current version 5.8.7 is not solving this
problem. Without changing anything but the Perl version, I get:

6 wallclock secs ( 4.50 usr + 0.22 sys = 4.72 CPU) @ 0.21/s (n=1)
69 wallclock secs (68.27 usr + 0.16 sys = 68.42 CPU) @ 0.01/s (n=1)
68 wallclock secs (67.30 usr + 0.31 sys = 67.61 CPU) @ 0.01/s (n=1)
68 wallclock secs (67.30 usr + 0.20 sys = 67.50 CPU) @ 0.01/s (n=1)
With "my @lines = <$fh>" instead of the while loop, I get:

22 wallclock secs (16.13 usr + 5.22 sys = 21.34 CPU) @ 0.05/s (n=1)
36 wallclock secs (35.38 usr + 0.22 sys = 35.59 CPU) @ 0.03/s (n=1)
6 wallclock secs ( 5.58 usr + 0.14 sys = 5.72 CPU) @ 0.17/s (n=1)
37 wallclock secs (36.88 usr + 0.17 sys = 37.05 CPU) @ 0.03/s (n=1)

And this turns into:

21 wallclock secs (17.45 usr + 4.19 sys = 21.64 CPU) @ 0.05/s (n=1)
255 wallclock secs (252.19 usr + 0.33 sys = 252.51 CPU) @ 0.00/s (n=1)
264 wallclock secs (255.63 usr + 0.38 sys = 256.00 CPU) @ 0.00/s (n=1)
261 wallclock secs (254.33 usr + 0.52 sys = 254.84 CPU) @ 0.00/s (n=1)

It seems, that someone want to prevent you from reading large files into
an array. But perhaps this slowdown affects other perl stuff too. Up to
now I thought something would go faster if memory is already allocated.
Seems to me, Perl isn't just re-using it rather than doing anything else
before.
I have run this on Windows XP SP2 with ActiveState's Perl 5.8.6.

As mentioned, I upgraded to Activestate's Perl 5.8.7 just now. Does
anyone know (or has any idea) what Perl is doing when re-using already
allocated memory on windowish systems?

Or is Windows itself the cause of this behavior? Could anyone reproduce
this problem with another Perl distribution under Windows?

regards,
fabian
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,577
Members
45,054
Latest member
LucyCarper

Latest Threads

Top