Performance Improvement of complex data structure (hash of hashes of hashes)

Discussion in 'Perl Misc' started by Scott Gilpin, Aug 24, 2004.

  1. Scott  Gilpin

    Scott Gilpin Guest

    Hi everyone -

    I'm trying to improve the performance (runtime) of a program that
    processes large files. The output of the processing is some fixed
    number of matrices (that can vary between invocations of the program),
    each of which has a different number of rows, and the same number of
    columns. However, the number of rows and columns may not be known
    until the last row of the original file is read. The original file
    contains approximately 100 millon rows. Each individual matrix has
    between 5 and 200 rows, and between 50 and 10000 columns. The data
    structure I'm using is a hash of hashes of hashes that stores this
    info. N is the total number of columns, M1 is the total number of
    rows in matrix #1, M2 is the total number of rows in matrix 2, etc,
    etc. The total number of matrices is between 3 and 15.


    matrix #1 => row name 1 => col name 1 => value of 1,1
    col name 2 => value of 1,2
    ......
    col name N => value of 1,N
    row name 2 => col name 1 => value of 2,1
    col name 2 => value of 2,2
    ......
    col name N => value of 2,N
    .....
    row name M1=> col name 1 => value of M1,1
    col name 2 => value of M1,2
    ......
    col name N => value of M1,N

    matrix #2 => row name 1 => col name 1 => value of 2,1
    col name 2 => value of 2,1
    ......
    col name N => value of 2,1
    .....
    row name M2=> col name 1 => value of M2,1
    col name 2 => value of M2,N
    ......
    col name N => value of M2,N

    etc, etc...

    Here is the code that I'm using to build up this data structure. I'm
    running perl version 5.8.3 on solaris 8 (sparc processor). The system
    is not memory bound or cpu bound - this program is really the only
    thing that runs. There are several gigabytes of memory, and this
    program doesn't grow bigger than around 100 MB. Right now the run time
    for the following while loop with 100 million rows of data is about 6
    hours. Any small improvements would be great.

    ## loop to process each row of the original data
    while(<INDATA>)
    {
    chomp($_);


    ## Each row is delimited with |
    my @original_row = split(/\|/o,$_);

    ## The cell value and the column name are always in the same
    position
    my $cell_value = $original_row[24];
    my $col_name = $original_row[1];

    ## Add this column name to the list of ones we've seen
    $columns_seen{$col_name}=1;

    ## For each matrix, loop through and increment the
    row/column value
    foreach my $matrix (@matrixList)
    {

    ## positionHash tells the position of the value for
    ## this matrix in the original data row
    my $row_name = $original_row[$positionHash{$matrix}];
    $matrix_values{$matrix}{$row_name}{$col_name} +=
    $cell_value;
    }

    } ## end while

    I tried using DProf & dprofpp, but that didn't reveal anything
    interesting. I also tried setting the initial size of each hash using
    'keys', but this didn't show any improvement. I could only initialize
    the hash of hashes - and not the third level of hashes (since I don't
    know the values in the second hash until they are read in from the
    file). I know that memory allocation in C is expensive, as is
    re-hashing - I suspect that's what's taking up a lot of the time.

    My specific questions are:

    Is there a profiler for perl that will produce output with information
    about the underlying C function calls? (eg - malloc) Or at least
    more information than DProf?
    Is there a more suitable data structure that I should use?
    Is there a way to allocate all the memory I would need at the beginning
    of the program, to eliminate subsequent memory allocation and
    rehashing? (My system has plenty of memory)
    Anything else I'm missing?

    Thanks in advance.
    Scott
    Scott Gilpin, Aug 24, 2004
    #1
    1. Advertising

  2. Scott  Gilpin

    Anno Siegel Guest

    Scott Gilpin <> wrote in comp.lang.perl.misc:
    > Hi everyone -
    >
    > I'm trying to improve the performance (runtime) of a program that
    > processes large files. The output of the processing is some fixed
    > number of matrices (that can vary between invocations of the program),
    > each of which has a different number of rows, and the same number of
    > columns. However, the number of rows and columns may not be known
    > until the last row of the original file is read. The original file
    > contains approximately 100 millon rows. Each individual matrix has
    > between 5 and 200 rows, and between 50 and 10000 columns. The data
    > structure I'm using is a hash of hashes of hashes that stores this
    > info. N is the total number of columns, M1 is the total number of
    > rows in matrix #1, M2 is the total number of rows in matrix 2, etc,
    > etc. The total number of matrices is between 3 and 15.


    [...]

    > Here is the code that I'm using to build up this data structure. I'm
    > running perl version 5.8.3 on solaris 8 (sparc processor). The system
    > is not memory bound or cpu bound - this program is really the only
    > thing that runs. There are several gigabytes of memory, and this
    > program doesn't grow bigger than around 100 MB. Right now the run time
    > for the following while loop with 100 million rows of data is about 6
    > hours. Any small improvements would be great.


    It shouldn't take that long, unless the data structure blows up way
    beyond 100 MB.

    > ## loop to process each row of the original data
    > while(<INDATA>)
    > {
    > chomp($_);
    >
    >
    > ## Each row is delimited with |
    > my @original_row = split(/\|/o,$_);
    >
    > ## The cell value and the column name are always in the same
    > position
    > my $cell_value = $original_row[24];
    > my $col_name = $original_row[1];
    >
    > ## Add this column name to the list of ones we've seen
    > $columns_seen{$col_name}=1;


    Where is this used?

    > ## For each matrix, loop through and increment the
    > row/column value
    > foreach my $matrix (@matrixList)


    Where is @matrixList set?

    > {
    >
    > ## positionHash tells the position of the value for
    > ## this matrix in the original data row
    > my $row_name = $original_row[$positionHash{$matrix}];


    Where is %positionHash set?

    > $matrix_values{$matrix}{$row_name}{$col_name} +=
    > $cell_value;
    > }
    >
    > } ## end while


    This code isn't runnable. How are we to improve code we can't run?

    To make it runnable, I had to realize that %positionHash is nowhere
    set and come up with a plausible one. Same for @matrixList. I had
    to find that %columns_seen is nowhere used, and discard it. Then I
    had to generate a set of input data of for the program to run with.
    It would have been your job to do that, and you are far better equipped
    to do it.

    That said, I don't see room for fundamental improvement. Apparently
    each "cell value" contributes to all matrices in the same column,
    but in lines that are determined indirectly (though %positionHash).

    You program does that without doing any excessive extra work. There
    may be re-arrangements of the data structure with corresponding code
    adaptions that make it marginally faster, but I wouldn't expect
    anything better than 10%.

    > I tried using DProf & dprofpp, but that didn't reveal anything
    > interesting.


    It can't. DProf works on subroutine basis, but your code doesn't
    use any subroutines.

    > I also tried setting the initial size of each hash using
    > 'keys', but this didn't show any improvement. I could only initialize
    > the hash of hashes - and not the third level of hashes (since I don't
    > know the values in the second hash until they are read in from the
    > file). I know that memory allocation in C is expensive, as is
    > re-hashing - I suspect that's what's taking up a lot of the time.


    One thing to observe is whether program speed deteriorates over
    time. Just print something to stderr every so-many records and
    watch the rhythm. If it gets slower with time, the problem is most
    likely memory related. If it doesn't, you're none the wiser.

    Anno
    Anno Siegel, Aug 25, 2004
    #2
    1. Advertising

  3. Scott  Gilpin

    Guest

    "Scott Gilpin" <> wrote:

    > Here is the code that I'm using to build up this data structure. I'm
    > running perl version 5.8.3 on solaris 8 (sparc processor). The system
    > is not memory bound or cpu bound -


    If it is not memory or cpu bound, then it must be I/O bound. That is
    pretty much the only other choice, right? Are you sure it isn't CPU bound?
    The fact that it is the only thing that runs on that machine doesn't mean
    it isn't CPU bound.


    > ## loop to process each row of the original data
    > while(<INDATA>)
    > {
    > chomp($_);
    >
    > ## Each row is delimited with |
    > my @original_row = split(/\|/o,$_);


    } #Let us say you end the loop right there.
    #How long does it take to run now?

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    Usenet Newsgroup Service $9.95/Month 30GB
    , Aug 26, 2004
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. rp
    Replies:
    1
    Views:
    477
    red floyd
    Nov 10, 2011
  2. ngoc
    Replies:
    4
    Views:
    150
    Robert Klemme
    May 27, 2005
  3. Perl Learner

    Hashes of hashes or just one hash ?

    Perl Learner, Jun 8, 2005, in forum: Perl Misc
    Replies:
    11
    Views:
    199
  4. Tim O'Donovan

    Hash of hashes, of hashes, of arrays of hashes

    Tim O'Donovan, Oct 27, 2005, in forum: Perl Misc
    Replies:
    5
    Views:
    191
  5. Replies:
    3
    Views:
    193
Loading...

Share This Page