Split line into an array vs multiple strings

S

scottmf

Can anyone explain why when I am reading in a file and saving the data
to a 2-d array it is faster if I split each line into an array rather
than a group of strings? Also why with each subsequent line I read in
does it take longer to process with the strings, whereas with the array
it takes the same amount of time for each line?

Thanks,
Scott

#!/usr/local/bin/perl
#
use Benchmark;
use strict;

# Create Sample File (sample.txt) and Array (@sample)
open(SAMPLE,'>sample.txt');
for (my $i=0;$i<20000;$i++) {
my $line = "abc"." "."def"." ".rand()." ".rand()." ".rand()."
".rand()." ".rand()."\n";
print SAMPLE $line;
}
close(SAMPLE);

# Count how long it takes to run each each version
my $count = 10;
timethese $count, {
'string_test' => \&string_test,
'array_test' => \&array_test
};

sub string_test{
my @array;
my $i;
open(SAMPLE, "sample.txt");
while(my $line = <SAMPLE>){
chomp($line);
my($el1, $el2, $el3, $el4, $el5, $el6, $el7) = split/\s+/,$line;
$array[$i][0] = $el1;
$array[$i][1] = $el2;
$array[$i][2] = $el3;
$array[$i][3] = $el4;
$array[$i][4] = $el5;
$array[$i][5] = $el6;
$array[$i][6] = $el7;
$i++;
}
close(SAMPLE);
}

sub array_test{
my @array;
my $i;
open(SAMPLE, "sample.txt");
while(my $line = <SAMPLE>){
chomp($line);
my @line_data = split/\s+/, $line;
$array[$i][0] = $line_data[0];
$array[$i][1] = $line_data[1];
$array[$i][2] = $line_data[2];
$array[$i][3] = $line_data[3];
$array[$i][4] = $line_data[4];
$array[$i][5] = $line_data[5];
$array[$i][6] = $line_data[6];
$i++;
}
close(SAMPLE);
}


returns:

Benchmark: timing 10 iterations of array_test, string_test...
array_test: 4 wallclock secs ( 4.30 usr + 0.00 sys = 4.30 CPU) @
2.33/s (n=10)
string_test: 18 wallclock secs (18.00 usr + 0.00 sys = 18.00 CPU) @
0.56/s (n=10)
 
J

John W. Krahn

scottmf said:
Can anyone explain why when I am reading in a file and saving the data
to a 2-d array it is faster if I split each line into an array rather
than a group of strings?

I can't explain it because on my computer the "string" version runs faster.
Also why with each subsequent line I read in
does it take longer to process with the strings, whereas with the array
it takes the same amount of time for each line?


#!/usr/local/bin/perl
#
use Benchmark;
use strict;

# Create Sample File (sample.txt) and Array (@sample)
open(SAMPLE,'>sample.txt');
for (my $i=0;$i<20000;$i++) {
my $line = "abc"." "."def"." ".rand()." ".rand()." ".rand()."
".rand()." ".rand()."\n";
print SAMPLE $line;
}
close(SAMPLE);

# Count how long it takes to run each each version
my $count = 10;
timethese $count, {
'string_test' => \&string_test,
'array_test' => \&array_test
};

sub string_test{
my @array;
my $i;
open(SAMPLE, "sample.txt");
while(my $line = <SAMPLE>){
chomp($line);
my($el1, $el2, $el3, $el4, $el5, $el6, $el7) = split/\s+/,$line;
$array[$i][0] = $el1;
$array[$i][1] = $el2;
$array[$i][2] = $el3;
$array[$i][3] = $el4;
$array[$i][4] = $el5;
$array[$i][5] = $el6;
$array[$i][6] = $el7;
$i++;
}
close(SAMPLE);
}

The usual way to do something like that in perl is:

sub some_test {
my @array;
open SAMPLE, '<', 'sample.txt' or die "Cannot open 'sample.txt' $!";
while ( <SAMPLE> ) {
push @array, [ split ];
}
close SAMPLE;
}

Which is a bit faster then your two examples.

And if you need to limit it to only the first seven fields:

sub some_test {
my @array;
open SAMPLE, '<', 'sample.txt' or die "Cannot open 'sample.txt' $!";
while ( <SAMPLE> ) {
push @array, [ ( split )[ 0 .. 6 ] ];
}
close SAMPLE;
}





John
 
S

scottmf

I ran some more tests starting with an input file of 10000 lines, and
increasing the filesize by 10000 lines for each benchmark, and I get
the following.
At this rate if my input file had 80000 lines it would take the string
method almost 30 times longer than the array method to just grab the
data. Also does anyone know why in the benchmark comparison the first
column changes from iterations per second to seconds per iteration?

Benchmark: timing 10 iterations of array_test, string_test...
array_test: 2 wallclock secs ( 2.09 usr + 0.02 sys = 2.11 CPU) @
4.74/s (n=10)
string_test: 6 wallclock secs ( 5.17 usr + 0.01 sys = 5.19 CPU) @
1.93/s (n=10)
Rate string_test array_test
string_test 1.93/s -- -59%
array_test 4.74/s 146% --
Benchmark: timing 10 iterations of array_test, string_test...
array_test: 4 wallclock secs ( 4.20 usr + 0.03 sys = 4.23 CPU) @
2.36/s (n=10)
string_test: 17 wallclock secs (16.52 usr + 0.02 sys = 16.53 CPU) @
0.60/s (n=10)
s/iter string_test array_test
string_test 1.65 -- -74%
array_test 0.423 290% --
Benchmark: timing 10 iterations of array_test, string_test...
array_test: 6 wallclock secs ( 6.31 usr + 0.02 sys = 6.33 CPU) @
1.58/s (n=10)
string_test: 39 wallclock secs (39.33 usr + 0.11 sys = 39.44 CPU) @
0.25/s (n=10)
s/iter string_test array_test
string_test 3.94 -- -84%
array_test 0.633 523% --
Benchmark: timing 10 iterations of array_test, string_test...
array_test: 8 wallclock secs ( 8.39 usr + 0.03 sys = 8.42 CPU) @
1.19/s (n=10)
string_test: 84 wallclock secs (83.25 usr + 0.05 sys = 83.30 CPU) @
0.12/s (n=10)
s/iter string_test array_test
string_test 8.33 -- -90%
array_test 0.842 889% --
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,007
Latest member
obedient dusk

Latest Threads

Top