Performance Improvement of complex data structure (hash of hashes of hashes)

S

Scott Gilpin

Hi everyone -

I'm trying to improve the performance (runtime) of a program that
processes large files. The output of the processing is some fixed
number of matrices (that can vary between invocations of the program),
each of which has a different number of rows, and the same number of
columns. However, the number of rows and columns may not be known
until the last row of the original file is read. The original file
contains approximately 100 millon rows. Each individual matrix has
between 5 and 200 rows, and between 50 and 10000 columns. The data
structure I'm using is a hash of hashes of hashes that stores this
info. N is the total number of columns, M1 is the total number of
rows in matrix #1, M2 is the total number of rows in matrix 2, etc,
etc. The total number of matrices is between 3 and 15.


matrix #1 => row name 1 => col name 1 => value of 1,1
col name 2 => value of 1,2
......
col name N => value of 1,N
row name 2 => col name 1 => value of 2,1
col name 2 => value of 2,2
......
col name N => value of 2,N
.....
row name M1=> col name 1 => value of M1,1
col name 2 => value of M1,2
......
col name N => value of M1,N

matrix #2 => row name 1 => col name 1 => value of 2,1
col name 2 => value of 2,1
......
col name N => value of 2,1
.....
row name M2=> col name 1 => value of M2,1
col name 2 => value of M2,N
......
col name N => value of M2,N

etc, etc...

Here is the code that I'm using to build up this data structure. I'm
running perl version 5.8.3 on solaris 8 (sparc processor). The system
is not memory bound or cpu bound - this program is really the only
thing that runs. There are several gigabytes of memory, and this
program doesn't grow bigger than around 100 MB. Right now the run time
for the following while loop with 100 million rows of data is about 6
hours. Any small improvements would be great.

## loop to process each row of the original data
while(<INDATA>)
{
chomp($_);


## Each row is delimited with |
my @original_row = split(/\|/o,$_);

## The cell value and the column name are always in the same
position
my $cell_value = $original_row[24];
my $col_name = $original_row[1];

## Add this column name to the list of ones we've seen
$columns_seen{$col_name}=1;

## For each matrix, loop through and increment the
row/column value
foreach my $matrix (@matrixList)
{

## positionHash tells the position of the value for
## this matrix in the original data row
my $row_name = $original_row[$positionHash{$matrix}];
$matrix_values{$matrix}{$row_name}{$col_name} +=
$cell_value;
}

} ## end while

I tried using DProf & dprofpp, but that didn't reveal anything
interesting. I also tried setting the initial size of each hash using
'keys', but this didn't show any improvement. I could only initialize
the hash of hashes - and not the third level of hashes (since I don't
know the values in the second hash until they are read in from the
file). I know that memory allocation in C is expensive, as is
re-hashing - I suspect that's what's taking up a lot of the time.

My specific questions are:

Is there a profiler for perl that will produce output with information
about the underlying C function calls? (eg - malloc) Or at least
more information than DProf?
Is there a more suitable data structure that I should use?
Is there a way to allocate all the memory I would need at the beginning
of the program, to eliminate subsequent memory allocation and
rehashing? (My system has plenty of memory)
Anything else I'm missing?

Thanks in advance.
Scott
 
A

Anno Siegel

Scott Gilpin said:
Hi everyone -

I'm trying to improve the performance (runtime) of a program that
processes large files. The output of the processing is some fixed
number of matrices (that can vary between invocations of the program),
each of which has a different number of rows, and the same number of
columns. However, the number of rows and columns may not be known
until the last row of the original file is read. The original file
contains approximately 100 millon rows. Each individual matrix has
between 5 and 200 rows, and between 50 and 10000 columns. The data
structure I'm using is a hash of hashes of hashes that stores this
info. N is the total number of columns, M1 is the total number of
rows in matrix #1, M2 is the total number of rows in matrix 2, etc,
etc. The total number of matrices is between 3 and 15.
[...]

Here is the code that I'm using to build up this data structure. I'm
running perl version 5.8.3 on solaris 8 (sparc processor). The system
is not memory bound or cpu bound - this program is really the only
thing that runs. There are several gigabytes of memory, and this
program doesn't grow bigger than around 100 MB. Right now the run time
for the following while loop with 100 million rows of data is about 6
hours. Any small improvements would be great.

It shouldn't take that long, unless the data structure blows up way
beyond 100 MB.
## loop to process each row of the original data
while(<INDATA>)
{
chomp($_);


## Each row is delimited with |
my @original_row = split(/\|/o,$_);

## The cell value and the column name are always in the same
position
my $cell_value = $original_row[24];
my $col_name = $original_row[1];

## Add this column name to the list of ones we've seen
$columns_seen{$col_name}=1;

Where is this used?
## For each matrix, loop through and increment the
row/column value
foreach my $matrix (@matrixList)

Where is @matrixList set?
{

## positionHash tells the position of the value for
## this matrix in the original data row
my $row_name = $original_row[$positionHash{$matrix}];

Where is %positionHash set?
$matrix_values{$matrix}{$row_name}{$col_name} +=
$cell_value;
}

} ## end while

This code isn't runnable. How are we to improve code we can't run?

To make it runnable, I had to realize that %positionHash is nowhere
set and come up with a plausible one. Same for @matrixList. I had
to find that %columns_seen is nowhere used, and discard it. Then I
had to generate a set of input data of for the program to run with.
It would have been your job to do that, and you are far better equipped
to do it.

That said, I don't see room for fundamental improvement. Apparently
each "cell value" contributes to all matrices in the same column,
but in lines that are determined indirectly (though %positionHash).

You program does that without doing any excessive extra work. There
may be re-arrangements of the data structure with corresponding code
adaptions that make it marginally faster, but I wouldn't expect
anything better than 10%.
I tried using DProf & dprofpp, but that didn't reveal anything
interesting.

It can't. DProf works on subroutine basis, but your code doesn't
use any subroutines.
I also tried setting the initial size of each hash using
'keys', but this didn't show any improvement. I could only initialize
the hash of hashes - and not the third level of hashes (since I don't
know the values in the second hash until they are read in from the
file). I know that memory allocation in C is expensive, as is
re-hashing - I suspect that's what's taking up a lot of the time.

One thing to observe is whether program speed deteriorates over
time. Just print something to stderr every so-many records and
watch the rhythm. If it gets slower with time, the problem is most
likely memory related. If it doesn't, you're none the wiser.

Anno
 
C

ctcgag

Scott Gilpin said:
Here is the code that I'm using to build up this data structure. I'm
running perl version 5.8.3 on solaris 8 (sparc processor). The system
is not memory bound or cpu bound -

If it is not memory or cpu bound, then it must be I/O bound. That is
pretty much the only other choice, right? Are you sure it isn't CPU bound?
The fact that it is the only thing that runs on that machine doesn't mean
it isn't CPU bound.

## loop to process each row of the original data
while(<INDATA>)
{
chomp($_);

## Each row is delimited with |
my @original_row = split(/\|/o,$_);

} #Let us say you end the loop right there.
#How long does it take to run now?

Xho
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads

Custom matrix multiplication produces different results to glm 0
Sorting hash of hashes 3
Sorting hash of hashes 9
hash of hashes 9
Hash of Hashes 5
ordered hashes 13
hash of arrays 1
hashes 2

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top