c programmer in need of perl advise

M

Mike Deskevich

i have a quick (hopefully) question for the perl gurus out there. i
have a bunch of data files that i need to read in and do some
processing. the data files are simple two columns of (floating point)
numbers, but the size of the file can range from 1000 to 10,000 lines.
i need to save the data in an array for post processing, so i can't
just read a line and throw the data away. my main question is: is
there a faster way to read the data than how i'm currently doing it
(i'm a c programmer, so i'm sure that i'm not using perl as
efficiently as i can)

here's how i read my data files

$ct=0;
while (<DATAFILE>)
{
($xvalue[$ct],$yvalue[$ct])=split;
$ct++;
}
#do stuff with xvalue and yvalue

is there a more efficient way to read in two columns of numbers? it
turns out that i have a series of these data files to process and i
think that most of the time is being used in either perl start up
time, or data reading time, the post processing is happening pretty
fast (i think)

thanks,
mike
 
M

Michael P. Broida

Mike said:
i have a quick (hopefully) question for the perl gurus out there. i
have a bunch of data files that i need to read in and do some
processing. the data files are simple two columns of (floating point)
numbers, but the size of the file can range from 1000 to 10,000 lines.
i need to save the data in an array for post processing, so i can't
just read a line and throw the data away. my main question is: is
there a faster way to read the data than how i'm currently doing it
(i'm a c programmer, so i'm sure that i'm not using perl as
efficiently as i can)

here's how i read my data files

$ct=0;
while (<DATAFILE>)
{
($xvalue[$ct],$yvalue[$ct])=split;
$ct++;
}
#do stuff with xvalue and yvalue

is there a more efficient way to read in two columns of numbers? it
turns out that i have a series of these data files to process and i
think that most of the time is being used in either perl start up
time, or data reading time, the post processing is happening pretty
fast (i think)

I would suggest NOT using index $ct. You can use
"push" to add elements to an array. AFTER all the
elements have been added, "scalar(@arrayname)" will
give you the number of entries in the array.

Dunno if that will help much, but it couldn't hurt.

Mike
 
C

chance

Mike Deskevich said:
i have a quick (hopefully) question for the perl gurus out there. i
have a bunch of data files that i need to read in and do some
processing. the data files are simple two columns of (floating point)
numbers, but the size of the file can range from 1000 to 10,000 lines.
i need to save the data in an array for post processing, so i can't
just read a line and throw the data away. my main question is: is
there a faster way to read the data than how i'm currently doing it
(i'm a c programmer, so i'm sure that i'm not using perl as
efficiently as i can)
here's how i read my data files
$ct=0;
while (<DATAFILE>)
{
($xvalue[$ct],$yvalue[$ct])=split;
$ct++;
}
#do stuff with xvalue and yvalue
is there a more efficient way to read in two columns of numbers? it
turns out that i have a series of these data files to process and i
think that most of the time is being used in either perl start up
time, or data reading time, the post processing is happening pretty
fast (i think)

I'm no perl guru. I'm really a C programmer who doesn't suck too bad
at perl. Above is pretty much how I'd do it.

If speed is really a problem I've got 2 suggestions:

1) if don't have @xvalue and @yvalue already allocated, you'll be doing a
lot of dynamic memory allocation. That could be costing you.

If you knew in advance the length you could do :

$xvalue[$vector_len - 1] = 0.0

if the lengths of your vectors really are unknowable, then you could
at least start pre-allocating chunks in advance and doubling the
size each time you 'run out of room', although then your going to
have to do some pain in the ass bookeeping. If you have a pretty
good upper bound the right thing to do might be to go ahead and
say $xvalue[10000] = 0.0. Then at the end set $#x to $ct-1 (modulo
my fencepost errors).

2) some kind of sscanf like function probably exists somewhere.
Might be more specialized, and hence faster than using split.

A real guru may have a much better answer
 
A

A. Sinan Unur

(e-mail address removed) (Mike Deskevich) wrote in
here's how i read my data files

$ct=0;
while (<DATAFILE>)
{
($xvalue[$ct],$yvalue[$ct])=split;
$ct++;
}

In this case, the $xvalue and $yvalue arrays are constantly being
resized. Eliminating that may increase performance, but you might want to
actually measure that. I doubt there is going to be a huge difference
with only 10000 records.

#! C:/Perl/bin/perl.exe

use strict;
use warnings;

my $fn = shift || 'data';

my $curr_max = 1000;

my @xvalue = ();
$#xvalue = $curr_max;
my @yvalue = ();
$#yvalue = $curr_max;

open(DATAFILE, "<", $fn) || die "Cannot open $fn: $!\n";
do {
my $i = 0;
while (<DATAFILE>) {
($xvalue[$i], $yvalue[$i]) = split;
++$i;
$curr_max = 2*$curr_max if($i >= $curr_max);
$#xvalue = $curr_max;
$#yvalue = $curr_max;
}
$#xvalue = $i;
$#yvalue = $i;
};

close(DATAFILE) || die "Cannot close input file: $!\n";

__END__
 
S

Steve Grazzini

A. Sinan Unur said:
$ct=0;
while (<DATAFILE>)
{
($xvalue[$ct],$yvalue[$ct])=split;
$ct++;
}

In this case, the $xvalue and $yvalue arrays are constantly being
resized.

I think "constantly" is too strong -- when an array needs to be
extended, perl will allocate (roughly) enough space for its size
to double.

Calling push() 10,000 times will only resize the array twelve
times, and, unlike "$#x = $BIGNUM", the end of the array will never
be filled with undefined elements.
 
S

Steve Grazzini

[ TOFU was lost under the signature ]
Its called "slurping" a file -- the so-called "pros" claim it is not really
"recommended" but I do it all the time with absolutely no consequence and no
obvious performance hit until the files exceed 6 MB or so ...

Presumably this scaling problem is what they had in mind.
if (open (DATA, "$FILENAME")) {

@DATA = <DATA>;

}

And anyway, how does this help the OP, who needs the first bit of each
line to go in one array and the second to go in another?

Also, you might want to reconsider using the special DATA filehandle
like this.
 
T

Tad McClellan

my main question is: is
there a faster way to read the data than how i'm currently doing it
here's how i read my data files
$ct=0;
while (<DATAFILE>)
{
($xvalue[$ct],$yvalue[$ct])=split;
$ct++;
}
i
think that most of the time is being used in either perl start up ^^^^^
time, or data reading time, the post processing is happening pretty
fast (i think)


Profile it and then you'll *know* where the slow part is.

If you knew in advance the length you could do :

$xvalue[$vector_len - 1] = 0.0


But what if 0.0 is a legal value?

How will you know that this is the bogus one?

$#xvalue = $vector_len - 1; # extend @xvalue array with undef values

if the lengths of your vectors really are unknowable, then you could
at least start pre-allocating chunks in advance and doubling the
size each time you 'run out of room',


Which is roughly what perl is doing for you already...

A real guru may have a much better answer


It seems unlikely to me that it is I/O bound.

The only way to know is to profile it, until then we're
spinning our wheels.
 
C

chance

If you knew in advance the length you could do :
$xvalue[$vector_len - 1] = 0.0

But what if 0.0 is a legal value?
How will you know that this is the bogus one?

well ..... was planning on writing over the 0.0 with valid data.

All irrelevant if perl does the allocation in a manner which
garuantees log(N) allocs though. so there went that guess.
The only way to know is to profile it, until then we're
spinning our wheels.

agreed. was just shooting from the hip.

Mr. Orginal poster: if you do actually profile, I would
be interested to know where the first bottleneck is.

my money is now on finding a sscanf-ish replacement for the
split statement. But thats just because I haven't thought of
anything else.
 
J

Joe Minicozzi

Greg Patnude said:
Its called "slurping" a file -- the so-called "pros" claim it is not really
"recommended" but I do it all the time with absolutely no consequence and no
obvious performance hit until the files exceed 6 MB or so ...

if (open (DATA, "$FILENAME")) {

@DATA = <DATA>;

}

you can read more about it in the Perl FAQ -->

http://www.perldoc.com/perl5.6/pod/perlfaq5.html#How-can-I-read-in-an-entire
-file-all-at-once-

--
Greg Patnude / The Digital Demention
2916 East Upper Hayden Lake Road
Hayden Lake, ID 83835
(208) 762-0762

Mike Deskevich said:
i have a quick (hopefully) question for the perl gurus out there. i
have a bunch of data files that i need to read in and do some
processing. the data files are simple two columns of (floating point)
numbers, but the size of the file can range from 1000 to 10,000 lines.
i need to save the data in an array for post processing, so i can't
just read a line and throw the data away. my main question is: is
there a faster way to read the data than how i'm currently doing it
(i'm a c programmer, so i'm sure that i'm not using perl as
efficiently as i can)

here's how i read my data files

$ct=0;
while (<DATAFILE>)
{
($xvalue[$ct],$yvalue[$ct])=split;
$ct++;
}
#do stuff with xvalue and yvalue

is there a more efficient way to read in two columns of numbers? it
turns out that i have a series of these data files to process and i
think that most of the time is being used in either perl start up
time, or data reading time, the post processing is happening pretty
fast (i think)

thanks,
mike

Not intending to be picky, but "slurping" a file usually means reading the
entire file into a single scalar variable as a string. This is done by
setting $/ to undef before the read.
 
T

Tore Aursand

$#ARRY will give you the number of array elements also ...

No. $#ARRAY will give you the highest index in @ARRAY, while @ARRAY in
scalar context gives you the number of elements.
 
T

Tore Aursand

$ct=0;
while (<DATAFILE>)
{
($xvalue[$ct],$yvalue[$ct])=split;
$ct++;
}
#do stuff with xvalue and yvalue

I created a file with 100.000 lines of tab-delimited floating point
number, and it took my computer 1.5 seconds to add the two columns to two
different arrays. How fast do you want it to be?
 
M

Mike Deskevich

If you knew in advance the length you could do :

$xvalue[$vector_len - 1] = 0.0

But what if 0.0 is a legal value?
How will you know that this is the bogus one?

well ..... was planning on writing over the 0.0 with valid data.

All irrelevant if perl does the allocation in a manner which
garuantees log(N) allocs though. so there went that guess.
The only way to know is to profile it, until then we're
spinning our wheels.

agreed. was just shooting from the hip.

Mr. Orginal poster: if you do actually profile, I would
be interested to know where the first bottleneck is.

my money is now on finding a sscanf-ish replacement for the
split statement. But thats just because I haven't thought of
anything else.


yes, i agree profiling is the best way to find the bottle neck. i'm
new to perl and don't know all the internals yet. are there built
in functions to help in profiling?

thanks!
mike
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,764
Messages
2,569,565
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top