I had assumed you were using Text::CSV_XS to parse lines rather than
printing them. You might want to try printing them in Perl, you never
know, it might be faster.
That's another data set, where the input and output data is in CSV. I
actually have several data sets that I convert into a single,
normalized, CSV which is then loaded into the DB with Text::CSV_XS.
Essentially, for most of the othe data sets, I have set up translation
maps that convert the various input fields into the correct output
fields. Upon load, each of the fields either gets inserted into the DB
directly or translated into a integer key by the perl before being
inserted.
For the other transforms, I have modularized and factored out much of
the translation code, but I think that that is part of the slow down
.... too many function calls. Its a bit too unwieldy to post here
(especially when I hand-copy it), but this thread has given me a bunch
of ideas, especially on the handling/conversions of arrays/hashes.
use strict; use warnings;
use IO::File; use Text::CSV_XS;
my @valid_columns = qw/ keya keyb keyc keyd keye keyf keyg keybig
keyanother /;
my %valid_columns = map {$_ => 1} @valid_columns;
my $output_csv = Text::CSV_XS->new({eol=>"\n", 'binary' => 1});
$output_csv->print(*STDOUT, process_line($_)) while (<>);
sub process_line {
my ($line) = @_;
my ($ip_range, $rest) = split /\s+/, $line, 2;
chomp($rest);
my %ip_details = (ip_range => $ip_range);
# split on ':', then split each element on '=' and stick in hash
map { my ($k, $v) = split /=/; $ip_details{$k} = $v } split(/:/,
$rest);
# fix up column with random bad bytes
$ip_details{keya} = s/[^\x20-\x7e]//g;
my @cols = map { $ip_details{$_} } @valid_columns;
return \@cols;
}I made a few changes that sped it up, but not by much:
my ($ip_range, $rest) = split /\s+/, $line, 2;
chomp($rest);
## This will give the same answer as the your map way for "well-formed"
## data. For malformed data, it will give a different, but probably
## equally meaningless, answer
my %ip_details = (split /[=:]/, $rest);
Thanks ... very simple and elegant. I sometimes tend to complicate
thing needlessly.
# I don't see what the point of this is, as you never use the ip_range key.
$ip_details{ip_range} = $ip_range;
Sorry, guess I missed putting that at the beginning of the
@valid_columns
## This takes a surprising amount of time, but I don't know what you can
## do about it.
# fix up column with random bad bytes
$ip_details{keya} = s/[^\x20-\x7e]//g;
Is this from experience or profiling? Is there an easy way to profile
single lines like this, without factoring them out into a subroutine?
## Use a hash slice rather than map:
return [@ip_details{@valid_columns}];
OK, this is a big question of mine, and what (I think) is the major
slowdown in some of my other code. My data flow usually looks like
this:
CSV -> arrayref -> hashref -> new hashref with transformed values and
names -> arrayref -> CSV
I did it this way so that I could factor out a lot of the common code
of associated with transforming and outputting a new input format. In
order to do this, I usually pass around a line of input in various
forms. For example, one sub will read the CSV into an arrayref, pass
it to another that will convert it to a hashref, which will then pass
it to another to transform the hashref's values, etc. I realize that
this is not necessarily the fastest way of doing these things, but it
helps a lot when I have 10 different input types being translated into
the same output type.
Because I have a few subs being called *many* times, I have a couple
local optimization questions:
-What is the most efficient way of calling a module subroutine and/or
object's method? I'm assuming that, like C, its cheaper to pass a
reference to a hash/array than to pass the actual hash/array, right?
Does this also hold true for the return value from a sub?
-What is the most efficient way of converting back and forth between a
hash and an array, when the key->index mapping is known? Does the
answer change at all when dealing with references?
-As above, when returning a value, does it make a difference if you
create a new local variable to return, or just return the computation
directly? I.e. (a very simple example):
my $a = {a=>'1', b=>'2'}; return $a;
# vs
return {a=>'1', b=>'2'};
I'd like to think that perl's compiler might be able to figure out that
these are equivalent, but perhaps I am wrong.
-Sometimes I assign one of the hash/array elements to a local value so
that I can transform it, and eventually assign it back to the hash. Is
this a "win" vs just transforming the hash value directly? I.e.:
sub transform {
my $hashref = (@_);
my $name = $hashref->{name_key};
$name = s/A/a/g;
# ... more $name transforms ...
$hashref->{name_key} = $name;
return $hashref;
}
# ... vs ...
sub transform {
my $hashref = (@_);
$hashref->{name_key} = s/A/a/g;
# ... more $hashref->{name_key} transforms ...
return $hashref;
}
I'm sure that the answer is "it depends", but my next question would be
"On what?". My naive thoughts would be that it would depend on the
number of hash lookups that are needed, which relies on the (C)
assumption that perl would not be able to cache the hashref lookup into
a "register". Are there any other costs to the hash lookup/update
implementation?
-When transforming values, is it more efficient to use the same
variable to hold the new value or to create a new variable? I'm
thinking that this is one of those space vs. time questions, but since
I have a lot of memory, I'd like to optimize for time. I.e.:
sub transform {
my ($manufacturer) = (@_);
%translation_of = ( 'Mercedes' => 'Luxury', 'BMW' => 'Luxury',
'Honda' => 'Normal');
$manufacturer = $translation_of{$manufacturer};
# do more stuff
}
# ... vs ...
sub transform {
my ($manufacturer) = (@_);
%translation_of = ( 'Mercedes' => 'Luxury', 'BMW' => 'Luxury',
'Honda' => 'Normal');
my $transformed_manufacturer = $translation_of{$manufacturer};
# do more stuff
}
I get about 44,000 lines per second. Anyway, I don't see any obvious
inefficiencies. Maybe parallelization is the better route afterall.
Thanks, I'll see what I can do with that, and the previous examples.