Very slow

G

George Mpouras

Create a test file with 20000000 same lines of 50 commas (it will be
1020000000 bytes)
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,

Now I have a perl and C program running almost the same code. Perl needs
about 6 minutes to finish while the C version finishes at 20 seconds.
The difference is huge. What can I do for a faster perl version
Following the two programs


--== C ==--


#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>

int main(void) {
char *line = NULL;
char *p = NULL;
int NC = 1;
int n = 0;
size_t len = 0;
ssize_t read;
while (getline(&line, &len, stdin) != -1) {
n = 1;
for (p = line; *p != '\0'; p++) {
if (*p == ',') { n++; }
else if (*p == '\\') { p++; }
}
if (n > NC) {
NC = n;
}
}
if (line) free(line);
printf("%d\n", NC);
return EXIT_SUCCESS;
}


--== Perl ==--



#!/usr/bin/perl

my $NC = 1;
open DATA, '<', '/work/test.txt' or die "$^E\n";

while (<DATA>)
{
my ($i, $n, $Length) = (0, 1, length $_);

while ( $i++ < $Length )
{
my $c = substr $_, $i, 1;
if ( $c eq ',' ) { $n++ } elsif ($c eq '\\') {$i++}
}

$NC = $n if $n > $NC
}

close DATA;
print "Fields $NC\n"
 
G

George Mpouras

Update . With the following variation the execution time is 20 seconds
instead of 7 minutes.
I can not undestanf why, but it works !


#!/usr/bin/perl
my ($NC, $i, $n, $Length) = (1,0,1,0);
open DATA, '<', '/work/data.txt' or die "$^E\n";
while (<DATA>)
{
($n,$Length)=(1,length $_);

do
{
substr($_,$i, 1) eq ',' ? ( $n++ ) : ( substr($_,$i, 1) eq '\\' ?
($i++):() ) }
until ($i++ > $Length);

$NC = $n if $n > $NC
}

close DATA;
print "Fields $NC\n";
 
E

Ed

Update . With the following variation the execution time is 20 seconds
instead of 7 minutes.
I can not undestanf why, but it works !

Perhaps because your adjustment means the code works down the string
from index 0 to length-1, instead of 1 to length (which is what the
original code did). I'm guessing there's special handling in place
for the substr call with the index set to the string length, and that
the special handling accounts for the performance difference.

original code post-increments at the start of the loop:
while ( $i++ < $Length )
{
my $c = substr $_, $i, 1;
if ( $c eq ',' ) { $n++ } elsif ($c eq '\\') {$i++}
}

modified code post-increments at the end of the loop:
 
G

George Mpouras

Thanks Ben,

What the code actually is doing is to count the fields while respecting
the number of \ escaped commas.
A 2k+1 number of slashes \ \\\ \\\\\ ... indicating a non separator
A 2k number of slashed \\ \\\\ ... is a valid comma separator
So your code does not give always correct results.
Have a look at the update I post. At my suprise the codes

while ( ... ) { ... } or
until ( ... ) { ... } or
for ( ... ) { ... }

are killing Perl speed. I can not explaing why but the aproach of

do
{
}
while ( ... )

is much faster. Time from 7 minutes decreased to 20 seconds (the same as
your speed). I am not very happy with that, because my C collegue is
loughing because perl cannot "for"














Στις 12/1/2012 4:21 μμ, ο/η Ben Morrow έγÏαψε:
 
G

George Mpouras

by the way here is a quick code, for creating the sample data

open FILE, '>', 'data.txt' or die;
for (1..20_000_000) { print FILE (','x50),"\n" }
close FILE;
 
E

Ed

Ed,
The magick is only at the "do" statement

I wouldn't so much say that it's magick - the adjustment fixes an off
by 1 bug. And it's not just a performance issue; the original code
gives the wrong answer for lines that have a comma at index 0.

To be very clear, for an input line of ",,", the original code does:

substr($_, 1, 1)
substr($_, 2, 1)

This looks at the second comma and an empty string (when substr is
called with 2, the index set to the length of the input string). My
guess is that the second call there is handled as a special case in
the perl code and that triggers the performance increase. The code
will report the number of fields as 2, and the answer ought to be 3.

The modified code does:

substr($_, 0, 1)
substr($_, 1, 1)

The modified code looks at both commas, and properly reports the
number of fields as 3.
 
G

George Mpouras

this is not the case. The is code is correct for both slow and fast
versions. Create only one line file and test it.
Also an odd number of \ are escaping the separator
while an even number of \ are not escaping
 
P

Peter J. Holzer

Thanks Ben,

What the code actually is doing is to count the fields while respecting
the number of \ escaped commas.
A 2k+1 number of slashes \ \\\ \\\\\ ... indicating a non separator
A 2k number of slashed \\ \\\\ ... is a valid comma separator
So your code does not give always correct results.

It should be s/\\.//g; instead of s/\\,//g;
Simple error (maybe a typo).
Have a look at the update I post. At my suprise the codes

while ( ... ) { ... } or
until ( ... ) { ... } or
for ( ... ) { ... }

are killing Perl speed. I can not explaing why but the aproach of

do
{
}
while ( ... )

is much faster. Time from 7 minutes decreased to 20 seconds (the same as
your speed). I am not very happy with that, because my C collegue is
loughing because perl cannot "for"

I don't see that.

I have 4 test programs:


#!/usr/bin/perl
# mpouras2 - your original code, with the off-by-one error fixed and
# reformatted by perltidy

my $NC = 1;
open DATA, '<', 'test.txt' or die "$^E\n";

while (<DATA>) {
my ( $i, $n, $Length ) = ( 0, 1, length $_ );

while ( $i < $Length ) {
my $c = substr $_, $i, 1;
if ( $c eq ',' ) {
$n++;
}
elsif ( $c eq '\\' ) {
$i++
}
$i++;
}

$NC = $n if $n > $NC;
}

close DATA;
print "Fields $NC\n"
__END__


#!/usr/bin/perl
# mpouras-do - while {} loop replaced with do {} while loop. No other
# change.

my $NC = 1;
open DATA, '<', 'test.txt' or die "$^E\n";

while (<DATA>) {
my ( $i, $n, $Length ) = ( 0, 1, length $_ );

do {
my $c = substr $_, $i, 1;
if ( $c eq ',' ) {
$n++;
}
elsif ( $c eq '\\' ) {
$i++
}
$i++;
}
while ($i < $Length);

$NC = $n if $n > $NC;
}

close DATA;
print "Fields $NC\n"
__END__


#!/usr/bin/perl
# mpouras-do-ed - Ed's version with do loop and ternary op instead of if

my $NC = 1;
open DATA, '<', 'test.txt' or die "$^E\n";

while (<DATA>) {
my ( $i, $n, $Length ) = ( 0, 1, length $_ );

do {
substr($_,$i, 1) eq ',' ? ( $n++ )
: substr($_,$i, 1) eq '\\' ? ($i++)
: ();
}
while (++$i < $Length);

$NC = $n if $n > $NC;
}

close DATA;
print "Fields $NC\n"
__END__


#!/usr/bin/perl
# mpouras-ben - Ben's version with the error fixed.

my $NC = 1;
open DATA, '<', 'test.txt' or die "$^E\n";

while (<DATA>) {
my ( $i, $n, $Length ) = ( 0, 1, length $_ );

s/\\.//g;
$n = tr/,// + 1;

$NC = $n if $n > $NC;
}

close DATA;
print "Fields $NC\n"
__END__

The run times (user time in seconds on a 1.86 GHz Core2, median of 5
runs, 2 million lines):

mpouras2: 57.76
mpouras-do: 65.90
mpouras-do-ed: 34.61
mpouras-ben: 2.43

So the do loop is actually a bit slower than the while loop. Using a
ternary operator instead if/else is quite a bit faster.

Of course using the built-in s/// and tr// operators is much faster than
processing the string 1 character at a time, no matter how much you
optimize it.


hp
 
K

Kaz Kylheku

Create a test file with 20000000 same lines of 50 commas (it will be
1020000000 bytes)
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,

Now I have a perl and C program running almost the same code. Perl needs
about 6 minutes to finish while the C version finishes at 20 seconds.
The difference is huge. What can I do for a faster perl version
Following the two programs


--== C ==--


#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>

int main(void) {
char *line = NULL;
char *p = NULL;
int NC = 1;
int n = 0;
size_t len = 0;
ssize_t read;

Unused variable.
while (getline(&line, &len, stdin) != -1) {
n = 1;
for (p = line; *p != '\0'; p++) {
if (*p == ',') { n++; }
else if (*p == '\\') { p++; }

The buffer returned by the getline function only includes the newline character
if one occurs in the data. What if it doesn't?

Then by dumb luck could end up with a string like "...," or "...\".

Then you increment the pointer, and now it points to the null terminator.
But then for loop's increment step step will increment the pointer again,
beyond the end of the string. Then the '*p != '\0' test is applied to an
invalid pointer.
}
if (n > NC) {
NC = n;
}
}
if (line) free(line);

You don't have to test for null because free(NULL) is well-defined behavior.

This line is a waste of cycles anyway if you know that the OS cleans up the
virtual memory in one fell swoop. You're just twiddling around some pointers
inside malloc's heap, when a moment later, that entire heap is going to be
history.

Releasing all memory is good in a debug build of a program, just so show that
you /can/ do it, and to help with hunting down leaks.
printf("%d\n", NC);
return EXIT_SUCCESS;
}

There is no need to dynamically allocate buffers to hold an entire
line. Try this one:

#include <stdio.h>

/* Look Ma, no malloc, no getline, no _GNU_SOURCE.
Not even a single pointer declaration. */

int main(void)
{
int NC = 1;
int n = 1;
enum state { init, esc, done } state = init;

while (state != done) {
int ch = getchar();
switch (ch) {
case '\n': case EOF:
if (n > NC)
NC = n;
n = 1;
state = (ch == EOF) ? done : init;
break;
case '\\':
switch (state) {
case init: state = esc; break;
case esc: state = init; break;
}
break;
case ',':
if (state == init)
n++;
/* fallthrough */
default:
state = init;
break;
}
}
printf("%d\n", NC);
return 0;
}
 
K

Kaz Kylheku

^^^
0, otherwise you get an off-by-one error

That is your opinion of what the requirement specification should be. This
programd seems to be determining the number of columns in a comma-separated
file (in which commas can be escaped by backslashes).

The initialization n = 1 asserts that the number of columns in an empty
line, or in a line which contains no (unescaped) commas is 1.

If a line contains characters, but no comma, then it clearly has one
column.

Whether an empty line counts as a column is debatable: it's a matter
that is settled in the requirement specification for the program.

Then you can discuss whether there exists an off-by-one bug.
 
G

George Mpouras

Στις 13/1/2012 12:05 πμ, ο/η Kaz Kylheku έγÏαψε:
Unused variable.


The buffer returned by the getline function only includes the newline character
if one occurs in the data. What if it doesn't?

Then by dumb luck could end up with a string like "...," or "...\".

Then you increment the pointer, and now it points to the null terminator.
But then for loop's increment step step will increment the pointer again,
beyond the end of the string. Then the '*p != '\0' test is applied to an
invalid pointer.


You don't have to test for null because free(NULL) is well-defined behavior.

This line is a waste of cycles anyway if you know that the OS cleans up the
virtual memory in one fell swoop. You're just twiddling around some pointers
inside malloc's heap, when a moment later, that entire heap is going to be
history.

Releasing all memory is good in a debug build of a program, just so show that
you /can/ do it, and to help with hunting down leaks.


There is no need to dynamically allocate buffers to hold an entire
line. Try this one:

#include<stdio.h>

/* Look Ma, no malloc, no getline, no _GNU_SOURCE.
Not even a single pointer declaration. */

int main(void)
{
int NC = 1;
int n = 1;
enum state { init, esc, done } state = init;

while (state != done) {
int ch = getchar();
switch (ch) {
case '\n': case EOF:
if (n> NC)
NC = n;
n = 1;
state = (ch == EOF) ? done : init;
break;
case '\\':
switch (state) {
case init: state = esc; break;
case esc: state = init; break;
}
break;
case ',':
if (state == init)
n++;
/* fallthrough */
default:
state = init;
break;
}
}
printf("%d\n", NC);
return 0;
}

thanks Kaz
 
G

George Mpouras

Hello Peter,

Unfortunately I hade a mistake , at cont loop , I had forget to set $i=0 .
When this corrected the time went againt to 7 minutes.
So the fastest solytion until now is the following. Unfortunately it is C
.....


#!/usr/bin/perl
# apt-get install libinline-perl

use Inline C;

my ($nc, $NC) = (undef, 1);
open DATA, '<', 'data.txt' or die "$^E\n";
while (<DATA>) { $n = nc($_); $n > $NC and $NC = $n }
close DATA;
print "Fields $NC\n"

__END__
__C__
int nc (char* str)
{
int i = 0;
int n = 1;
char c;
while (c = str[i++]) {
if (c == ',') n++;
else if (c == '\\') i++; }
return n;
}
 
G

George Mpouras

Dear Peter,

You are the man ! Your verrsion

s/\\.//g;
$n = tr/,// + 1;

Is as fast as Inline C , thanks .
 
J

Justin C

Hello Peter,

Unfortunately I hade a mistake , at cont loop , I had forget to set $i=0 .
When this corrected the time went againt to 7 minutes.
So the fastest solytion until now is the following. Unfortunately it is C
....

What's unfortunate about that? You use the best/correct tool for the
job; I don't dig my garden with a plough, and I wouldn't plough a field
with a spade.

Justin.
 
P

Peter J. Holzer

this is not the case. The is code is correct for both slow and fast
versions.

Nope. It ignores a , in the first column. So for your test data it
reports 50 columns instead of 51.

hp
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,010
Latest member
MerrillEic

Latest Threads

Top