Efficient field splitting? unpack or substr

Discussion in 'Perl Misc' started by ifiaz, Oct 10, 2003.

  1. ifiaz

    ifiaz Guest

    I have a data that looks like this in a single line.

    "01 17060757 EG 6880232 N 0131020321 17 060712 l 8828 TR6322
    00030070 01 20030317060807749544 060645 244 PA1"

    for about 280,000 lines.

    The fields are fixed-widths. You can't extract it using delimiters as
    some of the
    fields may be blank.

    I originally wrote an awkscript and used substr to extract the fields
    from $0
    and it took 25.26 seconds to calculate the summary.

    Field Splitting in awk, for your info
    F1 =substr($0, 1, 2)
    TiltTime =substr($0, 4, 8)
    ....
    ....

    Using awk to perl converter, the same thing in perl took only 11.03
    seconds.
    (awk to perl used substr as well)

    Field Splitting in awk to perl, for your info
    $F1 = substr($_, 1, 2);
    $TiltTime = substr($_, 4, 8);
    ....
    ....

    Now, I wrote a perl script, but only replaced the field splitting part
    with
    unpack. Now, the script takes 21.5 seconds.

    Field Splitting in perl using unpack, for your info

    ($F1, $TiltTime, ...) =
    unpack("a2xa8xa2xa3a5xa1xa10xa2xa6xa1xa4xa8xa6xa8xa2xa20xa6xa3xa3",
    $_);
    ....
    ....

    Why is unpack not efficient? Am I doing anything wrong?
    Should I stick to substr to do such field splitting in the future?
    Can I write it any other way to make it more efficient.


    - Fiaz Idris
     
    ifiaz, Oct 10, 2003
    #1
    1. Advertising

  2. On 9 Oct 2003 21:14:40 -0700
    (ifiaz) wrote:
    <snip>
    > Why is unpack not efficient? Am I doing anything wrong?
    > Should I stick to substr to do such field splitting in the future?
    > Can I write it any other way to make it more efficient.


    It does not appear that you're doing anything wrong. 'unpack' will
    look at the whole line and, well, unpack it :) 'substr', you're
    telling the script _exactly_ where to look, so it's not looking at the
    whole line.

    The question you need to ask yourself is this - do I _need_ to examine
    the whole line, or just extract the required data from the line? Use
    substr for just pieces of the line, unpack for the whole line.

    HTH

    --
    Jim

    Copyright notice: all code written by the author in this post is
    released under the GPL. http://www.gnu.org/licenses/gpl.txt
    for more information.

    a fortune quote ...
    Never hit a man with glasses. Hit him with a baseball bat.
     
    James Willmore, Oct 10, 2003
    #2
    1. Advertising

  3. ifiaz <> wrote:
    > Why is unpack not efficient?


    Remember that unpack() has to parse the template every time
    through the loop...

    > Should I stick to substr to do such field splitting in the
    > future?


    That's up to you. I'll just mention that the unpack() version
    can be much, much easier to read.

    @fields = qw(one two three ...);
    $template = qq(a4 a12 a3 ...);

    while (<>) {
    @data{ @fields } = unpack $template, $_;
    }

    --
    Steve
     
    Steve Grazzini, Oct 10, 2003
    #3
  4. (ifiaz) wrote in message news:<>...
    > I have a data that looks like this in a single line.
    >
    > "01 17060757 EG 6880232 N 0131020321 17 060712 l 8828 TR6322
    > 00030070 01 20030317060807749544 060645 244 PA1"
    >

    . . .
    > Using awk to perl converter, the same thing in perl took only 11.03
    > seconds.
    > (awk to perl used substr as well)
    >
    > Field Splitting in awk to perl, for your info
    > $F1 = substr($_, 1, 2);
    > $TiltTime = substr($_, 4, 8);
    > ....
    > ....
    >
    > Now, I wrote a perl script, but only replaced the field splitting part
    > with
    > unpack. Now, the script takes 21.5 seconds.
    >
    > Field Splitting in perl using unpack, for your info
    >
    > ($F1, $TiltTime, ...) =
    > unpack("a2xa8xa2xa3a5xa1xa10xa2xa6xa1xa4xa8xa6xa8xa2xa20xa6xa3xa3",
    > $_);
    > ....
    > ....
    >
    > Why is unpack not efficient? Am I doing anything wrong?
    > Should I stick to substr to do such field splitting in the future?
    > Can I write it any other way to make it more efficient.
    >
    >
    > - Fiaz Idris


    I have found that unpack is significantly slower as well. I can't say
    conclusively why, but my guess is that it's built to do much more than
    just extract certain characters from a string the way you appear to be
    using it.

    Believe it or not, a regex is very fast at this sort of thing if
    performance is a major concern.

    my $string = 'one two three four';
    my ($o,$tw,$th,$f) = $line =~ /^(...).(...).(.....).(....)/;
    # or /^(.{3}).(.{3}).(.{5}).(.{4})/

    Benchmark this against substr with your data, and I think you'll find
    that this is much faster. In past cases where I've looked to do
    something similar, the regex has won, except in cases where I've
    needed only a small portion of the large string.
     
    Christopher Hamel, Oct 10, 2003
    #4
  5. ifiaz

    Keith Keller Guest

    -----BEGIN PGP SIGNED MESSAGE-----
    Hash: SHA1
    NotDashEscaped: You need GnuPG to verify this message

    On 2003-10-10, ifiaz <> wrote:
    >
    > The fields are fixed-widths. You can't extract it using delimiters as
    > some of the
    > fields may be blank.


    If your delimiters are spaces, sure. If you are able to generate the
    file using a different delimiter (tab is a common one) then maybe
    splitting on the delimiter will be easier. Your only concern would be
    to pick a character that you know for certain doesn't appear in any of
    your data fields.

    --keith

    --
    -francisco.ca.us
    (try just my userid to email me)
    AOLSFAQ=http://wombat.san-francisco.ca.us/cgi-bin/fom

    -----BEGIN PGP SIGNATURE-----
    Version: GnuPG v1.0.6 (GNU/Linux)
    Comment: For info see http://www.gnupg.org

    iEYEARECAAYFAj+GzpgACgkQhVcNCxZ5ID+GAACfTbqQ/uY+Mgy8iwjSX10lTuky
    vvUAoIqgXfoDC2deKM9AcnN8FWNGZ2i7
    =n5+s
    -----END PGP SIGNATURE-----
     
    Keith Keller, Oct 10, 2003
    #5
  6. ifiaz

    ifiaz Guest

    Re: Efficient field splitting? unpack or substr or regex

    > I have found that unpack is significantly slower as well. I can't say
    > conclusively why, but my guess is that it's built to do much more than
    > just extract certain characters from a string the way you appear to be
    > using it.
    >
    > Believe it or not, a regex is very fast at this sort of thing if
    > performance is a major concern.
    >
    > my $string = 'one two three four';
    > my ($o,$tw,$th,$f) = $line =~ /^(...).(...).(.....).(....)/;
    > # or /^(.{3}).(.{3}).(.{5}).(.{4})/
    >
    > Benchmark this against substr with your data, and I think you'll find
    > that this is much faster. In past cases where I've looked to do
    > something similar, the regex has won, except in cases where I've
    > needed only a small portion of the large string.


    I did try to use the regex as you have told me.
    But, infact it is slower than substr.

    I forgot the time it took, it is about 21 seconds (certainly
    greater than 20 seconds). Since I am at home now for the
    weekend, I can't verify it exactly about the seconds.

    Thanks to all of you. If you have any further input on this
    you are most certainly welcome.
     
    ifiaz, Oct 10, 2003
    #6
  7. ifiaz

    ifiaz Guest

    > I have found that unpack is significantly slower as well. I can't say
    > conclusively why, but my guess is that it's built to do much more than
    > just extract certain characters from a string the way you appear to be
    > using it.
    >
    > Believe it or not, a regex is very fast at this sort of thing if
    > performance is a major concern.
    >
    > my $string = 'one two three four';
    > my ($o,$tw,$th,$f) = $line =~ /^(...).(...).(.....).(....)/;
    > # or /^(.{3}).(.{3}).(.{5}).(.{4})/
    >
    > Benchmark this against substr with your data, and I think you'll find
    > that this is much faster. In past cases where I've looked to do
    > something similar, the regex has won, except in cases where I've
    > needed only a small portion of the large string.


    Sorry for posting the reply in a compeltely new thread with the subject
    name "Re: Efficient field splitting? unpack or substr or regex".
    That was in error. So, I hereby repeat the reply below.

    I did try to use the regex as you have told me.
    But, infact it is slower than substr. It took 23.49 seconds.

    Field Splitting in perl using regex, for your info
    ($F1, $TiltTime, ....) = $_ =~ /(.{2}) (.{8}) (.{2}) ..../;
    Thanks to all of you. If you have any further input in this
    you are most certainly welcome.

    And, for you keith, The delimiters are space, and that can't be changed
    atleast for now.
     
    ifiaz, Oct 13, 2003
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.

Share This Page