S
Scott Gilpin
Hi everyone -
I'm trying to improve the performance (runtime) of a program that
processes large files. The output of the processing is some fixed
number of matrices (that can vary between invocations of the program),
each of which has a different number of rows, and the same number of
columns. However, the number of rows and columns may not be known
until the last row of the original file is read. The original file
contains approximately 100 millon rows. Each individual matrix has
between 5 and 200 rows, and between 50 and 10000 columns. The data
structure I'm using is a hash of hashes of hashes that stores this
info. N is the total number of columns, M1 is the total number of
rows in matrix #1, M2 is the total number of rows in matrix 2, etc,
etc. The total number of matrices is between 3 and 15.
matrix #1 => row name 1 => col name 1 => value of 1,1
col name 2 => value of 1,2
......
col name N => value of 1,N
row name 2 => col name 1 => value of 2,1
col name 2 => value of 2,2
......
col name N => value of 2,N
.....
row name M1=> col name 1 => value of M1,1
col name 2 => value of M1,2
......
col name N => value of M1,N
matrix #2 => row name 1 => col name 1 => value of 2,1
col name 2 => value of 2,1
......
col name N => value of 2,1
.....
row name M2=> col name 1 => value of M2,1
col name 2 => value of M2,N
......
col name N => value of M2,N
etc, etc...
Here is the code that I'm using to build up this data structure. I'm
running perl version 5.8.3 on solaris 8 (sparc processor). The system
is not memory bound or cpu bound - this program is really the only
thing that runs. There are several gigabytes of memory, and this
program doesn't grow bigger than around 100 MB. Right now the run time
for the following while loop with 100 million rows of data is about 6
hours. Any small improvements would be great.
## loop to process each row of the original data
while(<INDATA>)
{
chomp($_);
## Each row is delimited with |
my @original_row = split(/\|/o,$_);
## The cell value and the column name are always in the same
position
my $cell_value = $original_row[24];
my $col_name = $original_row[1];
## Add this column name to the list of ones we've seen
$columns_seen{$col_name}=1;
## For each matrix, loop through and increment the
row/column value
foreach my $matrix (@matrixList)
{
## positionHash tells the position of the value for
## this matrix in the original data row
my $row_name = $original_row[$positionHash{$matrix}];
$matrix_values{$matrix}{$row_name}{$col_name} +=
$cell_value;
}
} ## end while
I tried using DProf & dprofpp, but that didn't reveal anything
interesting. I also tried setting the initial size of each hash using
'keys', but this didn't show any improvement. I could only initialize
the hash of hashes - and not the third level of hashes (since I don't
know the values in the second hash until they are read in from the
file). I know that memory allocation in C is expensive, as is
re-hashing - I suspect that's what's taking up a lot of the time.
My specific questions are:
Is there a profiler for perl that will produce output with information
about the underlying C function calls? (eg - malloc) Or at least
more information than DProf?
Is there a more suitable data structure that I should use?
Is there a way to allocate all the memory I would need at the beginning
of the program, to eliminate subsequent memory allocation and
rehashing? (My system has plenty of memory)
Anything else I'm missing?
Thanks in advance.
Scott
I'm trying to improve the performance (runtime) of a program that
processes large files. The output of the processing is some fixed
number of matrices (that can vary between invocations of the program),
each of which has a different number of rows, and the same number of
columns. However, the number of rows and columns may not be known
until the last row of the original file is read. The original file
contains approximately 100 millon rows. Each individual matrix has
between 5 and 200 rows, and between 50 and 10000 columns. The data
structure I'm using is a hash of hashes of hashes that stores this
info. N is the total number of columns, M1 is the total number of
rows in matrix #1, M2 is the total number of rows in matrix 2, etc,
etc. The total number of matrices is between 3 and 15.
matrix #1 => row name 1 => col name 1 => value of 1,1
col name 2 => value of 1,2
......
col name N => value of 1,N
row name 2 => col name 1 => value of 2,1
col name 2 => value of 2,2
......
col name N => value of 2,N
.....
row name M1=> col name 1 => value of M1,1
col name 2 => value of M1,2
......
col name N => value of M1,N
matrix #2 => row name 1 => col name 1 => value of 2,1
col name 2 => value of 2,1
......
col name N => value of 2,1
.....
row name M2=> col name 1 => value of M2,1
col name 2 => value of M2,N
......
col name N => value of M2,N
etc, etc...
Here is the code that I'm using to build up this data structure. I'm
running perl version 5.8.3 on solaris 8 (sparc processor). The system
is not memory bound or cpu bound - this program is really the only
thing that runs. There are several gigabytes of memory, and this
program doesn't grow bigger than around 100 MB. Right now the run time
for the following while loop with 100 million rows of data is about 6
hours. Any small improvements would be great.
## loop to process each row of the original data
while(<INDATA>)
{
chomp($_);
## Each row is delimited with |
my @original_row = split(/\|/o,$_);
## The cell value and the column name are always in the same
position
my $cell_value = $original_row[24];
my $col_name = $original_row[1];
## Add this column name to the list of ones we've seen
$columns_seen{$col_name}=1;
## For each matrix, loop through and increment the
row/column value
foreach my $matrix (@matrixList)
{
## positionHash tells the position of the value for
## this matrix in the original data row
my $row_name = $original_row[$positionHash{$matrix}];
$matrix_values{$matrix}{$row_name}{$col_name} +=
$cell_value;
}
} ## end while
I tried using DProf & dprofpp, but that didn't reveal anything
interesting. I also tried setting the initial size of each hash using
'keys', but this didn't show any improvement. I could only initialize
the hash of hashes - and not the third level of hashes (since I don't
know the values in the second hash until they are read in from the
file). I know that memory allocation in C is expensive, as is
re-hashing - I suspect that's what's taking up a lot of the time.
My specific questions are:
Is there a profiler for perl that will produce output with information
about the underlying C function calls? (eg - malloc) Or at least
more information than DProf?
Is there a more suitable data structure that I should use?
Is there a way to allocate all the memory I would need at the beginning
of the program, to eliminate subsequent memory allocation and
rehashing? (My system has plenty of memory)
Anything else I'm missing?
Thanks in advance.
Scott