Very slow

Discussion in 'Perl Misc' started by George Mpouras, Jan 12, 2012.

  1. Create a test file with 20000000 same lines of 50 commas (it will be
    1020000000 bytes)
    ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,

    Now I have a perl and C program running almost the same code. Perl needs
    about 6 minutes to finish while the C version finishes at 20 seconds.
    The difference is huge. What can I do for a faster perl version
    Following the two programs


    --== C ==--


    #define _GNU_SOURCE
    #include <stdio.h>
    #include <stdlib.h>

    int main(void) {
    char *line = NULL;
    char *p = NULL;
    int NC = 1;
    int n = 0;
    size_t len = 0;
    ssize_t read;
    while (getline(&line, &len, stdin) != -1) {
    n = 1;
    for (p = line; *p != '\0'; p++) {
    if (*p == ',') { n++; }
    else if (*p == '\\') { p++; }
    }
    if (n > NC) {
    NC = n;
    }
    }
    if (line) free(line);
    printf("%d\n", NC);
    return EXIT_SUCCESS;
    }


    --== Perl ==--



    #!/usr/bin/perl

    my $NC = 1;
    open DATA, '<', '/work/test.txt' or die "$^E\n";

    while (<DATA>)
    {
    my ($i, $n, $Length) = (0, 1, length $_);

    while ( $i++ < $Length )
    {
    my $c = substr $_, $i, 1;
    if ( $c eq ',' ) { $n++ } elsif ($c eq '\\') {$i++}
    }

    $NC = $n if $n > $NC
    }

    close DATA;
    print "Fields $NC\n"
    George Mpouras, Jan 12, 2012
    #1
    1. Advertising

  2. Update . With the following variation the execution time is 20 seconds
    instead of 7 minutes.
    I can not undestanf why, but it works !


    #!/usr/bin/perl
    my ($NC, $i, $n, $Length) = (1,0,1,0);
    open DATA, '<', '/work/data.txt' or die "$^E\n";
    while (<DATA>)
    {
    ($n,$Length)=(1,length $_);

    do
    {
    substr($_,$i, 1) eq ',' ? ( $n++ ) : ( substr($_,$i, 1) eq '\\' ?
    ($i++):() ) }
    until ($i++ > $Length);

    $NC = $n if $n > $NC
    }

    close DATA;
    print "Fields $NC\n";
    George Mpouras, Jan 12, 2012
    #2
    1. Advertising

  3. George Mpouras

    Ed Guest

    On Jan 12, 9:31 am, "George Mpouras"
    <> wrote:
    > Update . With the following variation the execution time is 20 seconds
    > instead of 7 minutes.
    > I can not undestanf why, but it works !


    Perhaps because your adjustment means the code works down the string
    from index 0 to length-1, instead of 1 to length (which is what the
    original code did). I'm guessing there's special handling in place
    for the substr call with the index set to the string length, and that
    the special handling accounts for the performance difference.

    original code post-increments at the start of the loop:

    > while ( $i++ < $Length )
    > {
    > my $c = substr $_, $i, 1;
    > if ( $c eq ',' ) { $n++ } elsif ($c eq '\\') {$i++}
    > }


    modified code post-increments at the end of the loop:

    >  do
    >  {
    >  substr($_,$i, 1) eq ',' ? ( $n++ ) : ( substr($_,$i, 1) eq '\\' ?
    > ($i++):() ) }
    >  until ($i++ > $Length);
    Ed, Jan 12, 2012
    #3
  4. Thanks Ben,

    What the code actually is doing is to count the fields while respecting
    the number of \ escaped commas.
    A 2k+1 number of slashes \ \\\ \\\\\ ... indicating a non separator
    A 2k number of slashed \\ \\\\ ... is a valid comma separator
    So your code does not give always correct results.
    Have a look at the update I post. At my suprise the codes

    while ( ... ) { ... } or
    until ( ... ) { ... } or
    for ( ... ) { ... }

    are killing Perl speed. I can not explaing why but the aproach of

    do
    {
    }
    while ( ... )

    is much faster. Time from 7 minutes decreased to 20 seconds (the same as
    your speed). I am not very happy with that, because my C collegue is
    loughing because perl cannot "for"














    Στις 12/1/2012 4:21 μμ, ο/η Ben Morrow έγÏαψε:
    >
    > Quoth "George Mpouras"<>:
    >> Create a test file with 20000000 same lines of 50 commas (it will be
    >> 1020000000 bytes)
    >> ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
    >>
    >> Now I have a perl and C program running almost the same code. Perl needs
    >> about 6 minutes to finish while the C version finishes at 20 seconds.
    >> The difference is huge. What can I do for a faster perl version
    >> Following the two programs
    >>
    >> --== C ==--

    > <snip>
    >> while (getline(&line,&len, stdin) != -1) {
    >> n = 1;

    > ^^^
    > 0, otherwise you get an off-by-one error
    >
    > <snip>
    >> #!/usr/bin/perl
    >>
    >> my $NC = 1;
    >> open DATA, '<', '/work/test.txt' or die "$^E\n";
    >>
    >> while (<DATA>)
    >> {
    >> my ($i, $n, $Length) = (0, 1, length $_);
    >>
    >> while ( $i++< $Length )
    >> {
    >> my $c = substr $_, $i, 1;
    >> if ( $c eq ',' ) { $n++ } elsif ($c eq '\\') {$i++}
    >> }
    >>
    >> $NC = $n if $n> $NC
    >> }
    >>
    >> close DATA;
    >> print "Fields $NC\n"

    >
    > Well, try not writing C in Perl :).
    >
    > ~% time ./ccommas<commas
    > 50
    > 0m3.29s real 0m3.08s user 0m0.20s system
    > ~% time perl plcommas
    > Fields 50
    > 6m37.54s real 6m35.89s user 0m1.61s system
    > ~% time perl -nE'
    > BEGIN { $NC = 1 }
    > s/\\,//g; $c = tr/,//;
    > $c> $NC and $NC = $c;
    > END { say "Fields $NC" }'
    > Fields 50
    > 0m16.74s real 0m15.48s user 0m1.25s system
    > ~%
    >
    > So, not quite as fast as the C version, but not too bad.
    >
    > Ben
    >
    George Mpouras, Jan 12, 2012
    #4
  5. Ed,
    The magick is only at the "do" statement
    George Mpouras, Jan 12, 2012
    #5
  6. by the way here is a quick code, for creating the sample data

    open FILE, '>', 'data.txt' or die;
    for (1..20_000_000) { print FILE (','x50),"\n" }
    close FILE;
    George Mpouras, Jan 12, 2012
    #6
  7. George Mpouras

    Ed Guest

    On Jan 12, 3:10 pm, George Mpouras
    <> wrote:
    > Ed,
    > The magick is only at the "do" statement


    I wouldn't so much say that it's magick - the adjustment fixes an off
    by 1 bug. And it's not just a performance issue; the original code
    gives the wrong answer for lines that have a comma at index 0.

    To be very clear, for an input line of ",,", the original code does:

    substr($_, 1, 1)
    substr($_, 2, 1)

    This looks at the second comma and an empty string (when substr is
    called with 2, the index set to the length of the input string). My
    guess is that the second call there is handled as a special case in
    the perl code and that triggers the performance increase. The code
    will report the number of fields as 2, and the answer ought to be 3.

    The modified code does:

    substr($_, 0, 1)
    substr($_, 1, 1)

    The modified code looks at both commas, and properly reports the
    number of fields as 3.
    Ed, Jan 12, 2012
    #7
  8. this is not the case. The is code is correct for both slow and fast
    versions. Create only one line file and test it.
    Also an odd number of \ are escaping the separator
    while an even number of \ are not escaping
    George Mpouras, Jan 12, 2012
    #8
  9. On 2012-01-12 20:07, George Mpouras <> wrote:
    > Thanks Ben,
    >
    > What the code actually is doing is to count the fields while respecting
    > the number of \ escaped commas.
    > A 2k+1 number of slashes \ \\\ \\\\\ ... indicating a non separator
    > A 2k number of slashed \\ \\\\ ... is a valid comma separator
    > So your code does not give always correct results.


    It should be s/\\.//g; instead of s/\\,//g;
    Simple error (maybe a typo).

    > Have a look at the update I post. At my suprise the codes
    >
    > while ( ... ) { ... } or
    > until ( ... ) { ... } or
    > for ( ... ) { ... }
    >
    > are killing Perl speed. I can not explaing why but the aproach of
    >
    > do
    > {
    > }
    > while ( ... )
    >
    > is much faster. Time from 7 minutes decreased to 20 seconds (the same as
    > your speed). I am not very happy with that, because my C collegue is
    > loughing because perl cannot "for"


    I don't see that.

    I have 4 test programs:


    #!/usr/bin/perl
    # mpouras2 - your original code, with the off-by-one error fixed and
    # reformatted by perltidy

    my $NC = 1;
    open DATA, '<', 'test.txt' or die "$^E\n";

    while (<DATA>) {
    my ( $i, $n, $Length ) = ( 0, 1, length $_ );

    while ( $i < $Length ) {
    my $c = substr $_, $i, 1;
    if ( $c eq ',' ) {
    $n++;
    }
    elsif ( $c eq '\\' ) {
    $i++
    }
    $i++;
    }

    $NC = $n if $n > $NC;
    }

    close DATA;
    print "Fields $NC\n"
    __END__


    #!/usr/bin/perl
    # mpouras-do - while {} loop replaced with do {} while loop. No other
    # change.

    my $NC = 1;
    open DATA, '<', 'test.txt' or die "$^E\n";

    while (<DATA>) {
    my ( $i, $n, $Length ) = ( 0, 1, length $_ );

    do {
    my $c = substr $_, $i, 1;
    if ( $c eq ',' ) {
    $n++;
    }
    elsif ( $c eq '\\' ) {
    $i++
    }
    $i++;
    }
    while ($i < $Length);

    $NC = $n if $n > $NC;
    }

    close DATA;
    print "Fields $NC\n"
    __END__


    #!/usr/bin/perl
    # mpouras-do-ed - Ed's version with do loop and ternary op instead of if

    my $NC = 1;
    open DATA, '<', 'test.txt' or die "$^E\n";

    while (<DATA>) {
    my ( $i, $n, $Length ) = ( 0, 1, length $_ );

    do {
    substr($_,$i, 1) eq ',' ? ( $n++ )
    : substr($_,$i, 1) eq '\\' ? ($i++)
    : ();
    }
    while (++$i < $Length);

    $NC = $n if $n > $NC;
    }

    close DATA;
    print "Fields $NC\n"
    __END__


    #!/usr/bin/perl
    # mpouras-ben - Ben's version with the error fixed.

    my $NC = 1;
    open DATA, '<', 'test.txt' or die "$^E\n";

    while (<DATA>) {
    my ( $i, $n, $Length ) = ( 0, 1, length $_ );

    s/\\.//g;
    $n = tr/,// + 1;

    $NC = $n if $n > $NC;
    }

    close DATA;
    print "Fields $NC\n"
    __END__

    The run times (user time in seconds on a 1.86 GHz Core2, median of 5
    runs, 2 million lines):

    mpouras2: 57.76
    mpouras-do: 65.90
    mpouras-do-ed: 34.61
    mpouras-ben: 2.43

    So the do loop is actually a bit slower than the while loop. Using a
    ternary operator instead if/else is quite a bit faster.

    Of course using the built-in s/// and tr// operators is much faster than
    processing the string 1 character at a time, no matter how much you
    optimize it.


    hp



    --
    _ | Peter J. Holzer | Deprecating human carelessness and
    |_|_) | Sysadmin WSR | ignorance has no successful track record.
    | | | |
    __/ | http://www.hjp.at/ | -- Bill Code on
    Peter J. Holzer, Jan 12, 2012
    #9
  10. George Mpouras

    Kaz Kylheku Guest

    On 2012-01-12, George Mpouras <> wrote:
    > Create a test file with 20000000 same lines of 50 commas (it will be
    > 1020000000 bytes)
    > ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
    >
    > Now I have a perl and C program running almost the same code. Perl needs
    > about 6 minutes to finish while the C version finishes at 20 seconds.
    > The difference is huge. What can I do for a faster perl version
    > Following the two programs
    >
    >
    > --== C ==--
    >
    >
    > #define _GNU_SOURCE
    > #include <stdio.h>
    > #include <stdlib.h>
    >
    > int main(void) {
    > char *line = NULL;
    > char *p = NULL;
    > int NC = 1;
    > int n = 0;
    > size_t len = 0;
    > ssize_t read;


    Unused variable.

    > while (getline(&line, &len, stdin) != -1) {
    > n = 1;
    > for (p = line; *p != '\0'; p++) {
    > if (*p == ',') { n++; }
    > else if (*p == '\\') { p++; }


    The buffer returned by the getline function only includes the newline character
    if one occurs in the data. What if it doesn't?

    Then by dumb luck could end up with a string like "...," or "...\".

    Then you increment the pointer, and now it points to the null terminator.
    But then for loop's increment step step will increment the pointer again,
    beyond the end of the string. Then the '*p != '\0' test is applied to an
    invalid pointer.

    > }
    > if (n > NC) {
    > NC = n;
    > }
    > }
    > if (line) free(line);


    You don't have to test for null because free(NULL) is well-defined behavior.

    This line is a waste of cycles anyway if you know that the OS cleans up the
    virtual memory in one fell swoop. You're just twiddling around some pointers
    inside malloc's heap, when a moment later, that entire heap is going to be
    history.

    Releasing all memory is good in a debug build of a program, just so show that
    you /can/ do it, and to help with hunting down leaks.

    > printf("%d\n", NC);
    > return EXIT_SUCCESS;
    > }


    There is no need to dynamically allocate buffers to hold an entire
    line. Try this one:

    #include <stdio.h>

    /* Look Ma, no malloc, no getline, no _GNU_SOURCE.
    Not even a single pointer declaration. */

    int main(void)
    {
    int NC = 1;
    int n = 1;
    enum state { init, esc, done } state = init;

    while (state != done) {
    int ch = getchar();
    switch (ch) {
    case '\n': case EOF:
    if (n > NC)
    NC = n;
    n = 1;
    state = (ch == EOF) ? done : init;
    break;
    case '\\':
    switch (state) {
    case init: state = esc; break;
    case esc: state = init; break;
    }
    break;
    case ',':
    if (state == init)
    n++;
    /* fallthrough */
    default:
    state = init;
    break;
    }
    }
    printf("%d\n", NC);
    return 0;
    }
    Kaz Kylheku, Jan 12, 2012
    #10
  11. George Mpouras

    Kaz Kylheku Guest

    On 2012-01-12, Ben Morrow <> wrote:
    >
    > Quoth "George Mpouras" <>:
    >> Create a test file with 20000000 same lines of 50 commas (it will be
    >> 1020000000 bytes)
    >> ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
    >>
    >> Now I have a perl and C program running almost the same code. Perl needs
    >> about 6 minutes to finish while the C version finishes at 20 seconds.
    >> The difference is huge. What can I do for a faster perl version
    >> Following the two programs
    >>
    >> --== C ==--

    ><snip>
    >> while (getline(&line, &len, stdin) != -1) {
    >> n = 1;

    > ^^^
    > 0, otherwise you get an off-by-one error


    That is your opinion of what the requirement specification should be. This
    programd seems to be determining the number of columns in a comma-separated
    file (in which commas can be escaped by backslashes).

    The initialization n = 1 asserts that the number of columns in an empty
    line, or in a line which contains no (unescaped) commas is 1.

    If a line contains characters, but no comma, then it clearly has one
    column.

    Whether an empty line counts as a column is debatable: it's a matter
    that is settled in the requirement specification for the program.

    Then you can discuss whether there exists an off-by-one bug.
    Kaz Kylheku, Jan 12, 2012
    #11
  12. Στις 13/1/2012 12:05 πμ, ο/η Kaz Kylheku έγÏαψε:
    > On 2012-01-12, George Mpouras<> wrote:
    >> Create a test file with 20000000 same lines of 50 commas (it will be
    >> 1020000000 bytes)
    >> ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
    >>
    >> Now I have a perl and C program running almost the same code. Perl needs
    >> about 6 minutes to finish while the C version finishes at 20 seconds.
    >> The difference is huge. What can I do for a faster perl version
    >> Following the two programs
    >>
    >>
    >> --== C ==--
    >>
    >>
    >> #define _GNU_SOURCE
    >> #include<stdio.h>
    >> #include<stdlib.h>
    >>
    >> int main(void) {
    >> char *line = NULL;
    >> char *p = NULL;
    >> int NC = 1;
    >> int n = 0;
    >> size_t len = 0;
    >> ssize_t read;

    >
    > Unused variable.
    >
    >> while (getline(&line,&len, stdin) != -1) {
    >> n = 1;
    >> for (p = line; *p != '\0'; p++) {
    >> if (*p == ',') { n++; }
    >> else if (*p == '\\') { p++; }

    >
    > The buffer returned by the getline function only includes the newline character
    > if one occurs in the data. What if it doesn't?
    >
    > Then by dumb luck could end up with a string like "...," or "...\".
    >
    > Then you increment the pointer, and now it points to the null terminator.
    > But then for loop's increment step step will increment the pointer again,
    > beyond the end of the string. Then the '*p != '\0' test is applied to an
    > invalid pointer.
    >
    >> }
    >> if (n> NC) {
    >> NC = n;
    >> }
    >> }
    >> if (line) free(line);

    >
    > You don't have to test for null because free(NULL) is well-defined behavior.
    >
    > This line is a waste of cycles anyway if you know that the OS cleans up the
    > virtual memory in one fell swoop. You're just twiddling around some pointers
    > inside malloc's heap, when a moment later, that entire heap is going to be
    > history.
    >
    > Releasing all memory is good in a debug build of a program, just so show that
    > you /can/ do it, and to help with hunting down leaks.
    >
    >> printf("%d\n", NC);
    >> return EXIT_SUCCESS;
    >> }

    >
    > There is no need to dynamically allocate buffers to hold an entire
    > line. Try this one:
    >
    > #include<stdio.h>
    >
    > /* Look Ma, no malloc, no getline, no _GNU_SOURCE.
    > Not even a single pointer declaration. */
    >
    > int main(void)
    > {
    > int NC = 1;
    > int n = 1;
    > enum state { init, esc, done } state = init;
    >
    > while (state != done) {
    > int ch = getchar();
    > switch (ch) {
    > case '\n': case EOF:
    > if (n> NC)
    > NC = n;
    > n = 1;
    > state = (ch == EOF) ? done : init;
    > break;
    > case '\\':
    > switch (state) {
    > case init: state = esc; break;
    > case esc: state = init; break;
    > }
    > break;
    > case ',':
    > if (state == init)
    > n++;
    > /* fallthrough */
    > default:
    > state = init;
    > break;
    > }
    > }
    > printf("%d\n", NC);
    > return 0;
    > }


    thanks Kaz
    George Mpouras, Jan 12, 2012
    #12
  13. Hello Peter,

    Unfortunately I hade a mistake , at cont loop , I had forget to set $i=0 .
    When this corrected the time went againt to 7 minutes.
    So the fastest solytion until now is the following. Unfortunately it is C
    .....


    #!/usr/bin/perl
    # apt-get install libinline-perl

    use Inline C;

    my ($nc, $NC) = (undef, 1);
    open DATA, '<', 'data.txt' or die "$^E\n";
    while (<DATA>) { $n = nc($_); $n > $NC and $NC = $n }
    close DATA;
    print "Fields $NC\n"

    __END__
    __C__
    int nc (char* str)
    {
    int i = 0;
    int n = 1;
    char c;
    while (c = str[i++]) {
    if (c == ',') n++;
    else if (c == '\\') i++; }
    return n;
    }
    George Mpouras, Jan 13, 2012
    #13
  14. George Mpouras

    Justin C Guest

    On 2012-01-12, George Mpouras <> wrote:
    >
    > thanks Kaz


    Please trim quoted text to the relevant portion only.

    Justin.

    --
    Justin C, by the sea.
    Justin C, Jan 13, 2012
    #14
  15. Dear Peter,

    You are the man ! Your verrsion

    s/\\.//g;
    $n = tr/,// + 1;

    Is as fast as Inline C , thanks .
    George Mpouras, Jan 13, 2012
    #15
  16. George Mpouras

    Justin C Guest

    On 2012-01-13, George Mpouras <> wrote:
    > Hello Peter,
    >
    > Unfortunately I hade a mistake , at cont loop , I had forget to set $i=0 .
    > When this corrected the time went againt to 7 minutes.
    > So the fastest solytion until now is the following. Unfortunately it is C
    > ....


    What's unfortunate about that? You use the best/correct tool for the
    job; I don't dig my garden with a plough, and I wouldn't plough a field
    with a spade.

    Justin.

    --
    Justin C, by the sea.
    Justin C, Jan 13, 2012
    #16
  17. On 2012-01-12 21:26, George Mpouras <> wrote:
    > this is not the case. The is code is correct for both slow and fast
    > versions.


    Nope. It ignores a , in the first column. So for your test data it
    reports 50 columns instead of 51.

    hp


    --
    _ | Peter J. Holzer | Deprecating human carelessness and
    |_|_) | Sysadmin WSR | ignorance has no successful track record.
    | | | |
    __/ | http://www.hjp.at/ | -- Bill Code on
    Peter J. Holzer, Jan 13, 2012
    #17
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Earl Teigrob
    Replies:
    7
    Views:
    420
    Scott M.
    Feb 16, 2004
  2. Kevin
    Replies:
    1
    Views:
    437
    ~kurt
    May 25, 2007
  3. Kevin
    Replies:
    1
    Views:
    346
    Tris Orendorff
    May 25, 2007
  4. Alf P. Steinbach /Usenet

    Slow -- VERY slow brain

    Alf P. Steinbach /Usenet, Jun 16, 2011, in forum: C++
    Replies:
    17
    Views:
    500
    Noah Roberts
    Jun 29, 2011
  5. Nick Green
    Replies:
    4
    Views:
    182
    Nick Green
    Nov 18, 2009
Loading...

Share This Page