Slurp large files into an array, first is quick, rest are slow


G

gdtrob

I am slurping a series of large .csv files (6MB) directly into an array
one at a time (then querying). The first time I slurp a file it is
incredibly quick. The second time I do it the slurping is very slow
despite the fact that I close the file (using a filehandle) and undef
the array. here is the relevant code:

open (TARGETFILE,"CanRPT"."$chromosome".".csv") || die "can't open
targetfile: $!";
print "opened";
@chrfile = <TARGETFILE>; #slurp the chromosome-specific repeat file
into memory
print "slurped";

(and after each loop)

close (TARGETFILE);
undef @chrfile;

If it is possible to quickly/simply fix this I would much rather keep
this method than setting up a line by line input to the array. The
first slurp is very efficient.

I am using activestate perl 5.6 on a win32 system with 1 gig ram:
 
Ad

Advertisements

K

Kevin Collins

I am slurping a series of large .csv files (6MB) directly into an array
one at a time (then querying). The first time I slurp a file it is
incredibly quick. The second time I do it the slurping is very slow
despite the fact that I close the file (using a filehandle) and undef
the array. here is the relevant code:

open (TARGETFILE,"CanRPT"."$chromosome".".csv") || die "can't open
^^^^^^^^^^^^^

No need to quote this. It should either be:
open (TARGETFILE,"CanRPT".$chromosome.".csv") || die "can't open
or
open (TARGETFILE,"CanRPT$chromosome.csv") || die "can't open
targetfile: $!";
print "opened";
@chrfile = <TARGETFILE>; #slurp the chromosome-specific repeat file
into memory
print "slurped";

(and after each loop)

close (TARGETFILE);

Not that it answers your question, but you should be able to close your file
immediately after slurping it in, rather than after a loop...
undef @chrfile;

If it is possible to quickly/simply fix this I would much rather keep
this method than setting up a line by line input to the array. The
first slurp is very efficient.

I am using activestate perl 5.6 on a win32 system with 1 gig ram:


Kevin
 
M

Mark Clements

I am slurping a series of large .csv files (6MB) directly into an array
one at a time (then querying). The first time I slurp a file it is
incredibly quick. The second time I do it the slurping is very slow
despite the fact that I close the file (using a filehandle) and undef
the array. here is the relevant code:

open (TARGETFILE,"CanRPT"."$chromosome".".csv") || die "can't open
targetfile: $!";
print "opened";
@chrfile = <TARGETFILE>; #slurp the chromosome-specific repeat file
into memory
print "slurped";

(and after each loop)

close (TARGETFILE);
undef @chrfile;

If it is possible to quickly/simply fix this I would much rather keep
this method than setting up a line by line input to the array. The
first slurp is very efficient.

I am using activestate perl 5.6 on a win32 system with 1 gig ram:

I'd argue you'd be better off processing one line at a time, but anyway...

You need more detailed timing data: you are assuming that the extra time
is being spent in the slurp, but you have no timing data to prove this.

Use something like

Benchmark::Timer

to provide a detailed breakdown of where the time is being spent. You
may be surprised. It would be an idea to display file size and number of
lines at the same time.

Running with

use strict;
use warnings;

will save you a lot of heartache. Also, it is now recommended to use
lexically scoped filehandles:

open my $fh,"<","$filename"
or die "could not open $filename for read: $!";

You may also want to check out one of the cvs parsing modules available,
eg

DBD::CSV
Text::CSV_XS

Mark
 
A

A. Sinan Unur

(e-mail address removed) wrote in
I am slurping a series of large .csv files (6MB) directly into an
array one at a time (then querying). The first time I slurp a file it
is incredibly quick. The second time I do it the slurping is very slow
despite the fact that I close the file (using a filehandle) and undef
the array. here is the relevant code:

open (TARGETFILE,"CanRPT"."$chromosome".".csv") || die "can't open
targetfile: $!";
print "opened";
@chrfile = <TARGETFILE>; #slurp the chromosome-specific repeat file
into memory
print "slurped";

(and after each loop)

close (TARGETFILE);
undef @chrfile;

Here is what the loop body would look like if I were writing this:

{
my $name = sprintf 'CanRPT%s.csv', $chromosome;
open my $target, $name
or die "Cannot open '$name': $!";
my @chrfile = <$target>;

# do something with @chrfile
}
If it is possible to quickly/simply fix this I would much rather keep
this method than setting up a line by line input to the array. The
first slurp is very efficient.

I am using activestate perl 5.6 on a win32 system with 1 gig ram:

I am assuming the problem has to do with your coding style. You don't
seem to be using lexicals effectively, and the fact that you are
repeatedly slurping is a red flag.

Can't you read the file once (slurped or line-by-line) and build the
data structure it represents, and then use that data structure for
further processing.

It is impossible to tell without having seen the program, but the
constant slurping might be causing memory fragmentation and therefore
excessive pagefile hits. Dunno, really.

Sinan
--
 
S

Smegal

Thanks everyone,

I thought this might be a simple slurp usage problem ie: freeing up
memory or something because the program runs, its just really slow
after the first slurp. But I wasn't able to find anything google
searching. I'll look into improving my coding as suggested and see if
the problem persists.

Grant
 
Ad

Advertisements

E

Eric J. Roode

my $name = sprintf 'CanRPT%s.csv', $chromosome;

OOC, why use sprintf here instead of

my $name = "CanRPT$chromosome.csv";

?

--
Eric
`$=`;$_=\%!;($_)=/(.)/;$==++$|;($.,$/,$,,$\,$",$;,$^,$#,$~,$*,$:,@%)=(
$!=~/(.)(.).(.)(.)(.)(.)..(.)(.)(.)..(.)......(.)/,$"),$=++;$.++;$.++;
$_++;$_++;($_,$\,$,)=($~.$"."$;$/$%[$?]$_$\$,$:$%[$?]",$"&$~,$#,);$,++
;$,++;$^|=$";`$_$\$,$/$:$;$~$*$%[$?]$.$~$*${#}$%[$?]$;$\$"$^$~$*.>&$=`
 
Ad

Advertisements

B

Big and Blue

undef @chrfile;

Why bother? You are about to replace this with the read of the next
file. This means that you chuck away all of the memory allocation you have
just for Perl to reassign it all. This may lead to heap memory fragmentation.
 

Top