c programmer in need of perl advise

Mike Deskevich · Oct 22, 2003

i have a quick (hopefully) question for the perl gurus out there. i
have a bunch of data files that i need to read in and do some
processing. the data files are simple two columns of (floating point)
numbers, but the size of the file can range from 1000 to 10,000 lines.
i need to save the data in an array for post processing, so i can't
just read a line and throw the data away. my main question is: is
there a faster way to read the data than how i'm currently doing it
(i'm a c programmer, so i'm sure that i'm not using perl as
efficiently as i can)

here's how i read my data files

$ct=0;
while (<DATAFILE>)
{
($xvalue[$ct],$yvalue[$ct])=split;
$ct++;
}
#do stuff with xvalue and yvalue

is there a more efficient way to read in two columns of numbers? it
turns out that i have a series of these data files to process and i
think that most of the time is being used in either perl start up
time, or data reading time, the post processing is happening pretty
fast (i think)

thanks,
mike

Michael P. Broida · Oct 22, 2003

Mike said:
i have a quick (hopefully) question for the perl gurus out there. i
have a bunch of data files that i need to read in and do some
processing. the data files are simple two columns of (floating point)
numbers, but the size of the file can range from 1000 to 10,000 lines.
i need to save the data in an array for post processing, so i can't
just read a line and throw the data away. my main question is: is
there a faster way to read the data than how i'm currently doing it
(i'm a c programmer, so i'm sure that i'm not using perl as
efficiently as i can)

here's how i read my data files

$ct=0;
while (<DATAFILE>)
{
($xvalue[$ct],$yvalue[$ct])=split;
$ct++;
}
#do stuff with xvalue and yvalue

is there a more efficient way to read in two columns of numbers? it
turns out that i have a series of these data files to process and i
think that most of the time is being used in either perl start up
time, or data reading time, the post processing is happening pretty
fast (i think)

I would suggest NOT using index $ct. You can use
"push" to add elements to an array. AFTER all the
elements have been added, "scalar(@arrayname)" will
give you the number of entries in the array.

Dunno if that will help much, but it couldn't hurt.

Mike

chance · Oct 22, 2003

Mike Deskevich said:
i have a quick (hopefully) question for the perl gurus out there. i
have a bunch of data files that i need to read in and do some
processing. the data files are simple two columns of (floating point)
numbers, but the size of the file can range from 1000 to 10,000 lines.
i need to save the data in an array for post processing, so i can't
just read a line and throw the data away. my main question is: is
there a faster way to read the data than how i'm currently doing it
(i'm a c programmer, so i'm sure that i'm not using perl as
efficiently as i can)

here's how i read my data files

$ct=0;
while (<DATAFILE>)
{
($xvalue[$ct],$yvalue[$ct])=split;
$ct++;
}
#do stuff with xvalue and yvalue

is there a more efficient way to read in two columns of numbers? it
turns out that i have a series of these data files to process and i
think that most of the time is being used in either perl start up
time, or data reading time, the post processing is happening pretty
fast (i think)

I'm no perl guru. I'm really a C programmer who doesn't suck too bad
at perl. Above is pretty much how I'd do it.

If speed is really a problem I've got 2 suggestions:

1) if don't have @xvalue and @yvalue already allocated, you'll be doing a
lot of dynamic memory allocation. That could be costing you.

If you knew in advance the length you could do :

$xvalue[$vector_len - 1] = 0.0

if the lengths of your vectors really are unknowable, then you could
at least start pre-allocating chunks in advance and doubling the
size each time you 'run out of room', although then your going to
have to do some pain in the ass bookeeping. If you have a pretty
good upper bound the right thing to do might be to go ahead and
say $xvalue[10000] = 0.0. Then at the end set $#x to $ct-1 (modulo
my fencepost errors).

2) some kind of sscanf like function probably exists somewhere.
Might be more specialized, and hence faster than using split.

A real guru may have a much better answer

A. Sinan Unur · Oct 22, 2003

(e-mail address removed) (Mike Deskevich) wrote in

here's how i read my data files

$ct=0;
while (<DATAFILE>)
{
($xvalue[$ct],$yvalue[$ct])=split;
$ct++;
}

In this case, the $xvalue and $yvalue arrays are constantly being
resized. Eliminating that may increase performance, but you might want to
actually measure that. I doubt there is going to be a huge difference
with only 10000 records.

#! C:/Perl/bin/perl.exe

use strict;
use warnings;

my $fn = shift || 'data';

my $curr_max = 1000;

my @xvalue = ();
$#xvalue = $curr_max;
my @yvalue = ();
$#yvalue = $curr_max;

open(DATAFILE, "<", $fn) || die "Cannot open $fn: $!\n";
do {
my $i = 0;
while (<DATAFILE>) {
($xvalue[$i], $yvalue[$i]) = split;
++$i;
$curr_max = 2*$curr_max if($i >= $curr_max);
$#xvalue = $curr_max;
$#yvalue = $curr_max;
}
$#xvalue = $i;
$#yvalue = $i;
};

close(DATAFILE) || die "Cannot close input file: $!\n";

__END__

Steve Grazzini · Oct 22, 2003

A. Sinan Unur said:
$ct=0;
while (<DATAFILE>)
{
($xvalue[$ct],$yvalue[$ct])=split;
$ct++;
}

Click to expand...

In this case, the $xvalue and $yvalue arrays are constantly being
resized.

I think "constantly" is too strong -- when an array needs to be
extended, perl will allocate (roughly) enough space for its size
to double.

Calling push() 10,000 times will only resize the array twelve
times, and, unlike "$#x = $BIGNUM", the end of the array will never
be filled with undefined elements.

Steve Grazzini · Oct 22, 2003

[ TOFU was lost under the signature ]

Its called "slurping" a file -- the so-called "pros" claim it is not really
"recommended" but I do it all the time with absolutely no consequence and no
obvious performance hit until the files exceed 6 MB or so ...

Presumably this scaling problem is what they had in mind.

if (open (DATA, "$FILENAME")) {

@DATA = <DATA>;

}

And anyway, how does this help the OP, who needs the first bit of each
line to go in one array and the second to go in another?

Also, you might want to reconsider using the special DATA filehandle
like this.

Tad McClellan · Oct 22, 2003

my main question is: is
there a faster way to read the data than how i'm currently doing it
here's how i read my data files

Click to expand...

$ct=0;
while (<DATAFILE>)
{
($xvalue[$ct],$yvalue[$ct])=split;
$ct++;
}

Click to expand...

i
think that most of the time is being used in either perl start up ^^^^^
time, or data reading time, the post processing is happening pretty
fast (i think)

Click to expand...

Profile it and then you'll *know* where the slow part is.

If you knew in advance the length you could do :

$xvalue[$vector_len - 1] = 0.0

But what if 0.0 is a legal value?

How will you know that this is the bogus one?

$#xvalue = $vector_len - 1; # extend @xvalue array with undef values

if the lengths of your vectors really are unknowable, then you could
at least start pre-allocating chunks in advance and doubling the
size each time you 'run out of room',

Which is roughly what perl is doing for you already...

A real guru may have a much better answer

It seems unlikely to me that it is I/O bound.

The only way to know is to profile it, until then we're
spinning our wheels.

chance · Oct 23, 2003

If you knew in advance the length you could do :

$xvalue[$vector_len - 1] = 0.0

Click to expand...

But what if 0.0 is a legal value?

How will you know that this is the bogus one?

well ..... was planning on writing over the 0.0 with valid data.

All irrelevant if perl does the allocation in a manner which
garuantees log(N) allocs though. so there went that guess.

The only way to know is to profile it, until then we're
spinning our wheels.

agreed. was just shooting from the hip.

Mr. Orginal poster: if you do actually profile, I would
be interested to know where the first bottleneck is.

my money is now on finding a sscanf-ish replacement for the
split statement. But thats just because I haven't thought of
anything else.

Joe Minicozzi · Oct 23, 2003

Greg Patnude said:
Its called "slurping" a file -- the so-called "pros" claim it is not really
"recommended" but I do it all the time with absolutely no consequence and no
obvious performance hit until the files exceed 6 MB or so ...

if (open (DATA, "$FILENAME")) {

@DATA = <DATA>;

}

you can read more about it in the Perl FAQ -->

http://www.perldoc.com/perl5.6/pod/perlfaq5.html#How-can-I-read-in-an-entire
-file-all-at-once-

--
Greg Patnude / The Digital Demention
2916 East Upper Hayden Lake Road
Hayden Lake, ID 83835
(208) 762-0762

Mike Deskevich said:

i have a quick (hopefully) question for the perl gurus out there. i
have a bunch of data files that i need to read in and do some
processing. the data files are simple two columns of (floating point)
numbers, but the size of the file can range from 1000 to 10,000 lines.
i need to save the data in an array for post processing, so i can't
just read a line and throw the data away. my main question is: is
there a faster way to read the data than how i'm currently doing it
(i'm a c programmer, so i'm sure that i'm not using perl as
efficiently as i can)

here's how i read my data files

$ct=0;
while (<DATAFILE>)
{
($xvalue[$ct],$yvalue[$ct])=split;
$ct++;
}
#do stuff with xvalue and yvalue

is there a more efficient way to read in two columns of numbers? it
turns out that i have a series of these data files to process and i
think that most of the time is being used in either perl start up
time, or data reading time, the post processing is happening pretty
fast (i think)

thanks,
mike

Click to expand...

Not intending to be picky, but "slurping" a file usually means reading the
entire file into a single scalar variable as a string. This is done by
setting $/ to undef before the read.

Tore Aursand · Oct 23, 2003

$#ARRY will give you the number of array elements also ...

No. $#ARRAY will give you the highest index in @ARRAY, while @ARRAY in
scalar context gives you the number of elements.

Tore Aursand · Oct 23, 2003

$ct=0;
while (<DATAFILE>)
{
($xvalue[$ct],$yvalue[$ct])=split;
$ct++;
}
#do stuff with xvalue and yvalue

I created a file with 100.000 lines of tab-delimited floating point
number, and it took my computer 1.5 seconds to add the two columns to two
different arrays. How fast do you want it to be?

Mike Deskevich · Oct 23, 2003

If you knew in advance the length you could do :

$xvalue[$vector_len - 1] = 0.0

Click to expand...

But what if 0.0 is a legal value?

Click to expand...

How will you know that this is the bogus one?

Click to expand...

well ..... was planning on writing over the 0.0 with valid data.

All irrelevant if perl does the allocation in a manner which
garuantees log(N) allocs though. so there went that guess.

The only way to know is to profile it, until then we're
spinning our wheels.

Click to expand...

agreed. was just shooting from the hip.

Mr. Orginal poster: if you do actually profile, I would
be interested to know where the first bottleneck is.

my money is now on finding a sscanf-ish replacement for the
split statement. But thats just because I haven't thought of
anything else.

yes, i agree profiling is the best way to find the bottle neck. i'm
new to perl and don't know all the internals yet. are there built
in functions to help in profiling?

thanks!
mike

Tad McClellan · Oct 23, 2003

Mike Deskevich said:
are there built
in functions to help in profiling?

perldoc -q profile

Need a Programmer in javascript	5	Jul 24, 2023
[PAID][REMOTE] Hiring programmer/dev for indie game	2	Feb 19, 2023
Bad programmer and a slow learner needs advices	14	Nov 22, 2022
What projects can a new programmer undertake to get themselves up to industry level skillset?	2	Oct 23, 2022
HELP , with operating system related program in c.	1	Mar 27, 2023
Need Help with Windows.Forms in VS (C#)	0	Feb 2, 2020
Need help reading .wav file in C#	0	Jun 18, 2019
I Need Fix In Code	1	Apr 12, 2023

c programmer in need of perl advise

Mike Deskevich

Michael P. Broida

chance

A. Sinan Unur

Steve Grazzini

Steve Grazzini

Tad McClellan

chance

Joe Minicozzi

Tore Aursand

Tore Aursand

Mike Deskevich

Tad McClellan

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads