c programmer in need of perl advise

Discussion in 'Perl Misc' started by Mike Deskevich, Oct 22, 2003.

  1. i have a quick (hopefully) question for the perl gurus out there. i
    have a bunch of data files that i need to read in and do some
    processing. the data files are simple two columns of (floating point)
    numbers, but the size of the file can range from 1000 to 10,000 lines.
    i need to save the data in an array for post processing, so i can't
    just read a line and throw the data away. my main question is: is
    there a faster way to read the data than how i'm currently doing it
    (i'm a c programmer, so i'm sure that i'm not using perl as
    efficiently as i can)

    here's how i read my data files

    $ct=0;
    while (<DATAFILE>)
    {
    ($xvalue[$ct],$yvalue[$ct])=split;
    $ct++;
    }
    #do stuff with xvalue and yvalue

    is there a more efficient way to read in two columns of numbers? it
    turns out that i have a series of these data files to process and i
    think that most of the time is being used in either perl start up
    time, or data reading time, the post processing is happening pretty
    fast (i think)

    thanks,
    mike
    Mike Deskevich, Oct 22, 2003
    #1
    1. Advertising

  2. Mike Deskevich wrote:
    >
    > i have a quick (hopefully) question for the perl gurus out there. i
    > have a bunch of data files that i need to read in and do some
    > processing. the data files are simple two columns of (floating point)
    > numbers, but the size of the file can range from 1000 to 10,000 lines.
    > i need to save the data in an array for post processing, so i can't
    > just read a line and throw the data away. my main question is: is
    > there a faster way to read the data than how i'm currently doing it
    > (i'm a c programmer, so i'm sure that i'm not using perl as
    > efficiently as i can)
    >
    > here's how i read my data files
    >
    > $ct=0;
    > while (<DATAFILE>)
    > {
    > ($xvalue[$ct],$yvalue[$ct])=split;
    > $ct++;
    > }
    > #do stuff with xvalue and yvalue
    >
    > is there a more efficient way to read in two columns of numbers? it
    > turns out that i have a series of these data files to process and i
    > think that most of the time is being used in either perl start up
    > time, or data reading time, the post processing is happening pretty
    > fast (i think)


    I would suggest NOT using index $ct. You can use
    "push" to add elements to an array. AFTER all the
    elements have been added, "scalar(@arrayname)" will
    give you the number of entries in the array.

    Dunno if that will help much, but it couldn't hurt.

    Mike
    Michael P. Broida, Oct 22, 2003
    #2
    1. Advertising

  3. Mike Deskevich

    Guest

    Mike Deskevich <> wrote:
    > i have a quick (hopefully) question for the perl gurus out there. i
    > have a bunch of data files that i need to read in and do some
    > processing. the data files are simple two columns of (floating point)
    > numbers, but the size of the file can range from 1000 to 10,000 lines.
    > i need to save the data in an array for post processing, so i can't
    > just read a line and throw the data away. my main question is: is
    > there a faster way to read the data than how i'm currently doing it
    > (i'm a c programmer, so i'm sure that i'm not using perl as
    > efficiently as i can)


    > here's how i read my data files


    > $ct=0;
    > while (<DATAFILE>)
    > {
    > ($xvalue[$ct],$yvalue[$ct])=split;
    > $ct++;
    > }
    > #do stuff with xvalue and yvalue


    > is there a more efficient way to read in two columns of numbers? it
    > turns out that i have a series of these data files to process and i
    > think that most of the time is being used in either perl start up
    > time, or data reading time, the post processing is happening pretty
    > fast (i think)


    I'm no perl guru. I'm really a C programmer who doesn't suck too bad
    at perl. Above is pretty much how I'd do it.

    If speed is really a problem I've got 2 suggestions:

    1) if don't have @xvalue and @yvalue already allocated, you'll be doing a
    lot of dynamic memory allocation. That could be costing you.

    If you knew in advance the length you could do :

    $xvalue[$vector_len - 1] = 0.0

    if the lengths of your vectors really are unknowable, then you could
    at least start pre-allocating chunks in advance and doubling the
    size each time you 'run out of room', although then your going to
    have to do some pain in the ass bookeeping. If you have a pretty
    good upper bound the right thing to do might be to go ahead and
    say $xvalue[10000] = 0.0. Then at the end set $#x to $ct-1 (modulo
    my fencepost errors).

    2) some kind of sscanf like function probably exists somewhere.
    Might be more specialized, and hence faster than using split.

    A real guru may have a much better answer
    --
    I used to think government was a necessary evil.
    I'm not so sure about the necessary part anymore.
    , Oct 22, 2003
    #3
  4. (Mike Deskevich) wrote in
    news::

    > here's how i read my data files
    >
    > $ct=0;
    > while (<DATAFILE>)
    > {
    > ($xvalue[$ct],$yvalue[$ct])=split;
    > $ct++;
    > }


    In this case, the $xvalue and $yvalue arrays are constantly being
    resized. Eliminating that may increase performance, but you might want to
    actually measure that. I doubt there is going to be a huge difference
    with only 10000 records.

    #! C:/Perl/bin/perl.exe

    use strict;
    use warnings;

    my $fn = shift || 'data';

    my $curr_max = 1000;

    my @xvalue = ();
    $#xvalue = $curr_max;
    my @yvalue = ();
    $#yvalue = $curr_max;

    open(DATAFILE, "<", $fn) || die "Cannot open $fn: $!\n";
    do {
    my $i = 0;
    while (<DATAFILE>) {
    ($xvalue[$i], $yvalue[$i]) = split;
    ++$i;
    $curr_max = 2*$curr_max if($i >= $curr_max);
    $#xvalue = $curr_max;
    $#yvalue = $curr_max;
    }
    $#xvalue = $i;
    $#yvalue = $i;
    };

    close(DATAFILE) || die "Cannot close input file: $!\n";

    __END__

    --
    A. Sinan Unur

    Remove dashes for address
    Spam bait: mailto:
    A. Sinan Unur, Oct 22, 2003
    #4
  5. A. Sinan Unur <> wrote:
    > (Mike Deskevich) wrote:
    > > $ct=0;
    > > while (<DATAFILE>)
    > > {
    > > ($xvalue[$ct],$yvalue[$ct])=split;
    > > $ct++;
    > > }

    >
    > In this case, the $xvalue and $yvalue arrays are constantly being
    > resized.


    I think "constantly" is too strong -- when an array needs to be
    extended, perl will allocate (roughly) enough space for its size
    to double.

    Calling push() 10,000 times will only resize the array twelve
    times, and, unlike "$#x = $BIGNUM", the end of the array will never
    be filled with undefined elements.

    --
    Steve
    Steve Grazzini, Oct 22, 2003
    #5
  6. Greg Patnude <> wrote:

    [ TOFU was lost under the signature ]

    > Its called "slurping" a file -- the so-called "pros" claim it is not really
    > "recommended" but I do it all the time with absolutely no consequence and no
    > obvious performance hit until the files exceed 6 MB or so ...


    Presumably this scaling problem is what they had in mind.

    > if (open (DATA, "$FILENAME")) {
    >
    > @DATA = <DATA>;
    >
    > }


    And anyway, how does this help the OP, who needs the first bit of each
    line to go in one array and the second to go in another?

    Also, you might want to reconsider using the special DATA filehandle
    like this.

    --
    Steve
    Steve Grazzini, Oct 22, 2003
    #6
  7. <> wrote:
    > Mike Deskevich <> wrote:


    >> my main question is: is
    >> there a faster way to read the data than how i'm currently doing it


    >> here's how i read my data files

    >
    >> $ct=0;
    >> while (<DATAFILE>)
    >> {
    >> ($xvalue[$ct],$yvalue[$ct])=split;
    >> $ct++;
    >> }



    >> i
    >> think that most of the time is being used in either perl start up

    ^^^^^
    >> time, or data reading time, the post processing is happening pretty
    >> fast (i think)



    Profile it and then you'll *know* where the slow part is.


    > If you knew in advance the length you could do :
    >
    > $xvalue[$vector_len - 1] = 0.0



    But what if 0.0 is a legal value?

    How will you know that this is the bogus one?

    $#xvalue = $vector_len - 1; # extend @xvalue array with undef values


    > if the lengths of your vectors really are unknowable, then you could
    > at least start pre-allocating chunks in advance and doubling the
    > size each time you 'run out of room',



    Which is roughly what perl is doing for you already...


    > A real guru may have a much better answer



    It seems unlikely to me that it is I/O bound.

    The only way to know is to profile it, until then we're
    spinning our wheels.


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
    Tad McClellan, Oct 22, 2003
    #7
  8. Mike Deskevich

    Guest

    >> If you knew in advance the length you could do :
    >>
    >> $xvalue[$vector_len - 1] = 0.0



    > But what if 0.0 is a legal value?


    > How will you know that this is the bogus one?


    well ..... was planning on writing over the 0.0 with valid data.

    All irrelevant if perl does the allocation in a manner which
    garuantees log(N) allocs though. so there went that guess.

    >> if the lengths of your vectors really are unknowable, then you could
    >> at least start pre-allocating chunks in advance and doubling the
    >> size each time you 'run out of room',




    > The only way to know is to profile it, until then we're
    > spinning our wheels.


    agreed. was just shooting from the hip.

    Mr. Orginal poster: if you do actually profile, I would
    be interested to know where the first bottleneck is.

    my money is now on finding a sscanf-ish replacement for the
    split statement. But thats just because I haven't thought of
    anything else.
    --
    I used to think government was a necessary evil.
    I'm not so sure about the necessary part anymore.
    , Oct 23, 2003
    #8
  9. "Greg Patnude" <> wrote in message
    news:ZDylb.124175$...
    > Its called "slurping" a file -- the so-called "pros" claim it is not

    really
    > "recommended" but I do it all the time with absolutely no consequence and

    no
    > obvious performance hit until the files exceed 6 MB or so ...
    >
    > if (open (DATA, "$FILENAME")) {
    >
    > @DATA = <DATA>;
    >
    > }
    >
    > you can read more about it in the Perl FAQ -->
    >
    >

    http://www.perldoc.com/perl5.6/pod/perlfaq5.html#How-can-I-read-in-an-entire
    -file-all-at-once-
    >
    > --
    > Greg Patnude / The Digital Demention
    > 2916 East Upper Hayden Lake Road
    > Hayden Lake, ID 83835
    > (208) 762-0762
    >
    > "Mike Deskevich" <> wrote in message
    > news:...
    > > i have a quick (hopefully) question for the perl gurus out there. i
    > > have a bunch of data files that i need to read in and do some
    > > processing. the data files are simple two columns of (floating point)
    > > numbers, but the size of the file can range from 1000 to 10,000 lines.
    > > i need to save the data in an array for post processing, so i can't
    > > just read a line and throw the data away. my main question is: is
    > > there a faster way to read the data than how i'm currently doing it
    > > (i'm a c programmer, so i'm sure that i'm not using perl as
    > > efficiently as i can)
    > >
    > > here's how i read my data files
    > >
    > > $ct=0;
    > > while (<DATAFILE>)
    > > {
    > > ($xvalue[$ct],$yvalue[$ct])=split;
    > > $ct++;
    > > }
    > > #do stuff with xvalue and yvalue
    > >
    > > is there a more efficient way to read in two columns of numbers? it
    > > turns out that i have a series of these data files to process and i
    > > think that most of the time is being used in either perl start up
    > > time, or data reading time, the post processing is happening pretty
    > > fast (i think)
    > >
    > > thanks,
    > > mike

    >
    >


    Not intending to be picky, but "slurping" a file usually means reading the
    entire file into a single scalar variable as a string. This is done by
    setting $/ to undef before the read.
    Joe Minicozzi, Oct 23, 2003
    #9
  10. Mike Deskevich

    Tore Aursand Guest

    On Wed, 22 Oct 2003 17:16:53 +0000, Greg Patnude wrote:
    > $#ARRY will give you the number of array elements also ...


    No. $#ARRAY will give you the highest index in @ARRAY, while @ARRAY in
    scalar context gives you the number of elements.


    --
    Tore Aursand <>
    Tore Aursand, Oct 23, 2003
    #10
  11. Mike Deskevich

    Tore Aursand Guest

    On Wed, 22 Oct 2003 09:26:31 -0700, Mike Deskevich wrote:
    > $ct=0;
    > while (<DATAFILE>)
    > {
    > ($xvalue[$ct],$yvalue[$ct])=split;
    > $ct++;
    > }
    > #do stuff with xvalue and yvalue


    I created a file with 100.000 lines of tab-delimited floating point
    number, and it took my computer 1.5 seconds to add the two columns to two
    different arrays. How fast do you want it to be?


    --
    Tore Aursand <>
    Tore Aursand, Oct 23, 2003
    #11
  12. wrote in message news:<bn7k1k$frh$>...
    > >> If you knew in advance the length you could do :
    > >>
    > >> $xvalue[$vector_len - 1] = 0.0

    >
    >
    > > But what if 0.0 is a legal value?

    >
    > > How will you know that this is the bogus one?

    >
    > well ..... was planning on writing over the 0.0 with valid data.
    >
    > All irrelevant if perl does the allocation in a manner which
    > garuantees log(N) allocs though. so there went that guess.
    >
    > >> if the lengths of your vectors really are unknowable, then you could
    > >> at least start pre-allocating chunks in advance and doubling the
    > >> size each time you 'run out of room',

    >
    >
    >
    > > The only way to know is to profile it, until then we're
    > > spinning our wheels.

    >
    > agreed. was just shooting from the hip.
    >
    > Mr. Orginal poster: if you do actually profile, I would
    > be interested to know where the first bottleneck is.
    >
    > my money is now on finding a sscanf-ish replacement for the
    > split statement. But thats just because I haven't thought of
    > anything else.



    yes, i agree profiling is the best way to find the bottle neck. i'm
    new to perl and don't know all the internals yet. are there built
    in functions to help in profiling?

    thanks!
    mike
    Mike Deskevich, Oct 23, 2003
    #12
  13. Mike Deskevich <> wrote:

    > are there built
    > in functions to help in profiling?



    perldoc -q profile


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
    Tad McClellan, Oct 23, 2003
    #13
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Jay
    Replies:
    1
    Views:
    291
    Alvin Bruney [MVP]
    Mar 2, 2004
  2. =?Utf-8?B?c2VyZ2UgY2FsZGVyYXJh?=

    Need your advise ????

    =?Utf-8?B?c2VyZ2UgY2FsZGVyYXJh?=, Nov 2, 2005, in forum: ASP .Net
    Replies:
    4
    Views:
    416
    =?Utf-8?B?c2VyZ2UgY2FsZGVyYXJh?=
    Nov 4, 2005
  3. Replies:
    0
    Views:
    389
  4. kackson
    Replies:
    0
    Views:
    328
    kackson
    Apr 14, 2004
  5. Ahmad

    Advise on drawing charts in Perl

    Ahmad, Oct 13, 2003, in forum: Perl Misc
    Replies:
    2
    Views:
    127
    Martien Verbruggen
    Oct 13, 2003
Loading...

Share This Page