Perl hash of hash efficiency.

Discussion in 'Perl Misc' started by tak, Aug 2, 2006.

  1. tak

    tak Guest

    Hi,

    I have a script, that loads a txt file, with 240k lines in it to a hash
    currently. And when it loads the data to the hash - it becomes slower
    and slower when it reaches may be around 150k (probably due to
    collision, since perl's hash uses linear chaininig...).

    So, i am thinking of implementing it in hash of hash... using the first
    letter of these records, and divide them into 26 sub hashes. (Since
    these records' first letter are pretty random from A-Z).

    So, I tried it, the performance is about the same... why??

    The difference between what i had, and what i am doing now is,

    Before:
    $mainHash{$somekey} = \@values;

    After:
    my $letter = substr $values, 0 , 1;
    $mainHash{$letter}{$somekey} = \@values;

    Shouldnt the "after" format helps the memory usage / efficiency by
    alot? B/c it lowered the collision by 26 times...???

    Thanks,
    T
    tak, Aug 2, 2006
    #1
    1. Advertising

  2. tak

    Guest

    "tak" <> wrote:
    > Hi,
    >
    > I have a script, that loads a txt file, with 240k lines in it to a hash
    > currently. And when it loads the data to the hash - it becomes slower
    > and slower when it reaches may be around 150k


    How much memory do you have? How much are you using at this point?

    > (probably due to
    > collision, since perl's hash uses linear chaininig...).


    That is a rather unlikely bit of speculation, especially on a modern Perl.
    How many buckets does your hash have and use? (print scalar %hash).

    > So, i am thinking of implementing it in hash of hash... using the first
    > letter of these records, and divide them into 26 sub hashes. (Since
    > these records' first letter are pretty random from A-Z).
    >
    > So, I tried it, the performance is about the same... why??


    Solving the imagined problem rarely solves the real problem. (And if that
    were the problem it was that easy to fix, don't you think Perl would
    already have made that fix itself in the core hashing code?)

    When the facts don't fit your theory, re-examing your theory. You probably
    have a swapping problem, not a hash collision problem. And if you do have
    a collision problem, the better way to fix it would be to start out with a
    higher number of buckets, by assigning to the keys function.

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    Usenet Newsgroup Service $9.95/Month 30GB
    , Aug 2, 2006
    #2
    1. Advertising

  3. tak

    tak Guest

    wrote:
    > "tak" <> wrote:
    > > Hi,
    > >
    > > I have a script, that loads a txt file, with 240k lines in it to a hash
    > > currently. And when it loads the data to the hash - it becomes slower
    > > and slower when it reaches may be around 150k

    >
    > How much memory do you have? How much are you using at this point?


    >From looking at the PF Usage - it is about 1.9 GB. on a 1gb machine -

    the available physical memory are down to about 10 MB when loading...
    But the CPU usage remains about 5% only...


    >
    > > (probably due to
    > > collision, since perl's hash uses linear chaininig...).

    >
    > That is a rather unlikely bit of speculation, especially on a modern Perl.
    > How many buckets does your hash have and use? (print scalar %hash).
    >


    I have 1 main_hash, which stores 27 hashes in it. And out of each 27
    hashes, it averages about 9k unique strings. print scalar %hash
    reports, 23/32. What does this number mean?


    > > So, i am thinking of implementing it in hash of hash... using the first
    > > letter of these records, and divide them into 26 sub hashes. (Since
    > > these records' first letter are pretty random from A-Z).
    > >
    > > So, I tried it, the performance is about the same... why??

    >
    > Solving the imagined problem rarely solves the real problem. (And if that
    > were the problem it was that easy to fix, don't you think Perl would
    > already have made that fix itself in the core hashing code?)
    >


    So, what do you suggest?


    > When the facts don't fit your theory, re-examing your theory. You probably
    > have a swapping problem, not a hash collision problem. And if you do have
    > a collision problem, the better way to fix it would be to start out with a
    > higher number of buckets, by assigning to the keys function.
    >


    Can you elaborate on what you mean by a swapping problem? And I thought
    about assigning higher number of bucket to the hash itself , but i
    cannot find the related function to set that... I am a Java programmer,
    and this is my first perl script.. I tried looking into the constructor
    for the hash itself, but it doesnt seem like it accepts argument...?

    Last question,

    How Do you delete an element within a hoh? Say i have a hash of hash,
    like the following.

    my %hoh();

    loop() { # say this is the loop of each line of my txtFile
    my $value = "TheRecordFromMyTxtFile";
    my $letter = substr $value, 0, 1; # say, i am using the first letter
    as the key for subhash.
    my $myKey = substr $value, 5, 9; # Say position 5 - 9 is the key for
    the element.
    $hoh{$letter}{$myKey} = $value
    }


    Now, I want to delete a particular value from one of the subhash...

    I tried doing this,

    delete $hoh{$letter}{$value};

    But it doesnt seem like it is deleting... B/c if I try to get the
    length of the $hoh{$letter}, it still reports the same number...


    Thanks!
    tak



    > Xho
    >
    > --
    > -------------------- http://NewsReader.Com/ --------------------
    > Usenet Newsgroup Service $9.95/Month 30GB
    tak, Aug 2, 2006
    #3
  4. tak

    Paul Lalli Guest

    tak wrote:
    > wrote:
    > > "tak" <> wrote:

    > I have 1 main_hash, which stores 27 hashes in it. And out of each 27
    > hashes, it averages about 9k unique strings. print scalar %hash
    > reports, 23/32. What does this number mean?


    It means that Perl has allocated 32 buckets for this hash, and that 23
    of them are currently in use. So, only 4 collissions in the "main"
    hash.

    > > And if you do have
    > > a collision problem, the better way to fix it would be to start out with a
    > > higher number of buckets, by assigning to the keys function.
    > >

    >
    > And I thought
    > about assigning higher number of bucket to the hash itself , but i
    > cannot find the related function to set that...


    Xho just gave it to you. The `keys` function. If you read the Perl
    documentation on this function, by typing at your console window:
    perldoc -f keys
    you will find:
    ====================
    As an lvalue "keys" allows you to increase the
    number of hash buckets allocated for the given hash.
    This can gain you a measure of efficiency if you
    know the hash is going to get big. (This is similar
    to pre-extending an array by assigning a larger
    number to $#array.) If you say

    keys %hash = 200;

    then %hash will have at least 200 buckets allocated
    for it--256 of them, in fact, since it rounds up to
    the next power of two.
    ====================
    > I am a Java programmer, and this is my first perl script..


    Welcome to the world of Perl. You'll love it, I promise. :)

    > I tried looking into the constructor for the hash itself,


    There is no such thing. Constructors are methods of classes. Hashes
    are native data types, not objects. They are simply declared and used.

    > but it doesnt seem like it accepts argument...?


    I don't know what you were trying to give arguments to. If you mean
    simply the `my` keyword, then you are correct - you cannot pre-allocate
    buckets when you declare the hash. Instead, declare it, and then
    assign buckets, using the keys function as described above.

    > How Do you delete an element within a hoh? Say i have a hash of hash,
    > like the following.
    >
    > my %hoh();
    >
    > loop() { # say this is the loop of each line of my txtFile


    I don't know what this means. This is not Perl code. Please show real
    code whenever possible.

    > my $value = "TheRecordFromMyTxtFile";
    > my $letter = substr $value, 0, 1; # say, i am using the first letter
    > as the key for subhash.
    > my $myKey = substr $value, 5, 9; # Say position 5 - 9 is the key for
    > the element.
    > $hoh{$letter}{$myKey} = $value
    > }
    >
    >
    > Now, I want to delete a particular value from one of the subhash...
    >
    > I tried doing this,
    >
    > delete $hoh{$letter}{$value};


    That is precisely how you delete that particular value from that
    particular "subhash".

    > But it doesnt seem like it is deleting... B/c if I try to get the
    > length of the $hoh{$letter}, it still reports the same number...


    What, exactly, do you mean by "get the length of the $hoh{$letter}"?
    Again, please show real code whenever possible. Did you, by any
    chance, do something like:
    print length $hoh{$letter};
    ?

    That does not, at all, give you what you want. When you use a
    reference in a scalar context, you get a string representing the type
    of reference and it's memory address. To see what I mean, try printing
    out:
    print $hoh{$letter};
    You should see something like
    HASH(0x14d410)
    It is this string that you were passing to the length function.
    Obviously, the length of that string isn't going to change simply
    because you removed one of the key/value pairs.
    To determine the "size" (that is, number of keys) of a hash, you again
    use the keys function, this time in sclar context:
    print scalar(keys %{$hoh{$letter}});

    The additional punctuation around $hoh{$letter} is what we call
    "dereferencing" a reference. You can read all about it by typing at
    your console window:
    perldoc perlreftut
    perldoc perllol
    perldoc perldsc

    Hope this helps,
    Paul Lalli
    Paul Lalli, Aug 2, 2006
    #4
  5. tak

    Guest

    "tak" <> wrote:
    > wrote:
    > > "tak" <> wrote:
    > > > Hi,
    > > >
    > > > I have a script, that loads a txt file, with 240k lines in it to a
    > > > hash currently. And when it loads the data to the hash - it becomes
    > > > slower and slower when it reaches may be around 150k

    > >
    > > How much memory do you have? How much are you using at this point?

    >
    > >From looking at the PF Usage - it is about 1.9 GB. on a 1gb machine -

    > the available physical memory are down to about 10 MB when loading...
    > But the CPU usage remains about 5% only...


    If I read this right, you are using 1.9GB of virtual memory but you only
    have 1.0GB of RAM, and so have swapped out 900MB? And you have only 10MB
    of *free* memory? That means you are probably swapping like crazy, and
    thus the CPU load is low as it spends most of the time waiting for disk.

    > >
    > > > (probably due to
    > > > collision, since perl's hash uses linear chaininig...).

    > >
    > > That is a rather unlikely bit of speculation, especially on a modern
    > > Perl. How many buckets does your hash have and use? (print scalar
    > > %hash).
    > >

    >
    > I have 1 main_hash, which stores 27 hashes in it. And out of each 27
    > hashes, it averages about 9k unique strings. print scalar %hash
    > reports, 23/32. What does this number mean?


    There are 32 buckets, of which 23 have at least one value. So there is
    either 1 five-way collision, or 4 two-way collisions, or somewhere in
    between. But that doesn't really tell you much. I meant for you to get
    this number on the one-big-hash implementation, before you changed over to
    the hash of hashes.

    > > > So, i am thinking of implementing it in hash of hash... using the
    > > > first letter of these records, and divide them into 26 sub hashes.
    > > > (Since these records' first letter are pretty random from A-Z).
    > > >
    > > > So, I tried it, the performance is about the same... why??

    > >
    > > Solving the imagined problem rarely solves the real problem. (And if
    > > that were the problem it was that easy to fix, don't you think Perl
    > > would already have made that fix itself in the core hashing code?)
    > >

    >
    > So, what do you suggest?


    Identifying the real problem first :)

    Right now, I'd say that that is memory. So you could try to find a more
    memory efficient way to hold your data. Or make a multi-pass approach.
    Or maybe tie the hash to disk explicitly. Or use a database to hold the
    data. (Or buy more memory)


    >
    > > When the facts don't fit your theory, re-examing your theory. You
    > > probably have a swapping problem, not a hash collision problem. And if
    > > you do have a collision problem, the better way to fix it would be to
    > > start out with a higher number of buckets, by assigning to the keys
    > > function.
    > >

    >
    > Can you elaborate on what you mean by a swapping problem?


    Your OS has a virtual memory system that lets you use more memory than
    you actually have, by swapping/paging out "unused" memory to disk. But it
    can get very, very slow if you are actively using more pages than fit in
    real memory.

    (Paul answered the rest of your questions)

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    Usenet Newsgroup Service $9.95/Month 30GB
    , Aug 2, 2006
    #5
  6. tak

    tak Guest

    Paul Lalli wrote:
    > tak wrote:
    > > wrote:
    > > > "tak" <> wrote:

    > > I have 1 main_hash, which stores 27 hashes in it. And out of each 27
    > > hashes, it averages about 9k unique strings. print scalar %hash
    > > reports, 23/32. What does this number mean?

    >
    > It means that Perl has allocated 32 buckets for this hash, and that 23
    > of them are currently in use. So, only 4 collissions in the "main"
    > hash.


    Why 4 collision? Do you mean 32 - 23 = 9?? or b/c you knew that i have
    27 subhashes, so 27 - 23?? Without using the 27 hash of hashes, print
    scalar %mainHash reports 16k / 23k. (Of course - it reported the
    number, but I didnt take them down, just remember its 16k and 23k) That
    is a hugh amount of collision.


    >
    > > > And if you do have
    > > > a collision problem, the better way to fix it would be to start out with a
    > > > higher number of buckets, by assigning to the keys function.
    > > >

    > >
    > > And I thought
    > > about assigning higher number of bucket to the hash itself , but i
    > > cannot find the related function to set that...

    >
    > Xho just gave it to you. The `keys` function. If you read the Perl
    > documentation on this function, by typing at your console window:
    > perldoc -f keys
    > you will find:
    > ====================
    > As an lvalue "keys" allows you to increase the
    > number of hash buckets allocated for the given hash.
    > This can gain you a measure of efficiency if you
    > know the hash is going to get big. (This is similar
    > to pre-extending an array by assigning a larger
    > number to $#array.) If you say
    >
    > keys %hash = 200;
    >
    > then %hash will have at least 200 buckets allocated
    > for it--256 of them, in fact, since it rounds up to
    > the next power of two.
    > ====================
    > > I am a Java programmer, and this is my first perl script..

    >
    > Welcome to the world of Perl. You'll love it, I promise. :)
    >
    > > I tried looking into the constructor for the hash itself,

    >
    > There is no such thing. Constructors are methods of classes. Hashes
    > are native data types, not objects. They are simply declared and used.
    >
    > > but it doesnt seem like it accepts argument...?

    >
    > I don't know what you were trying to give arguments to. If you mean
    > simply the `my` keyword, then you are correct - you cannot pre-allocate
    > buckets when you declare the hash. Instead, declare it, and then
    > assign buckets, using the keys function as described above.



    I tried to do this, keys %main_hash = 300000; but it is still running
    slow when it reaches 150000... perhaps it is not the collision problem,
    as xho mentioned?



    >
    > > How Do you delete an element within a hoh? Say i have a hash of hash,
    > > like the following.
    > >
    > > my %hoh();
    > >
    > > loop() { # say this is the loop of each line of my txtFile

    >
    > I don't know what this means. This is not Perl code. Please show real
    > code whenever possible.
    >
    > > my $value = "TheRecordFromMyTxtFile";
    > > my $letter = substr $value, 0, 1; # say, i am using the first letter
    > > as the key for subhash.
    > > my $myKey = substr $value, 5, 9; # Say position 5 - 9 is the key for
    > > the element.
    > > $hoh{$letter}{$myKey} = $value
    > > }
    > >
    > >
    > > Now, I want to delete a particular value from one of the subhash...
    > >
    > > I tried doing this,
    > >
    > > delete $hoh{$letter}{$value};

    >
    > That is precisely how you delete that particular value from that
    > particular "subhash".
    >


    Say I have the key of subhash, as $letter, and the item in subhash as,
    $value.

    delete $hoh{$letter}{$value};

    This should delete that from the hash, right?




    > > But it doesnt seem like it is deleting... B/c if I try to get the
    > > length of the $hoh{$letter}, it still reports the same number...

    >
    > What, exactly, do you mean by "get the length of the $hoh{$letter}"?
    > Again, please show real code whenever possible. Did you, by any
    > chance, do something like:
    > print length $hoh{$letter};
    > ?
    >
    > That does not, at all, give you what you want. When you use a
    > reference in a scalar context, you get a string representing the type
    > of reference and it's memory address. To see what I mean, try printing
    > out:
    > print $hoh{$letter};
    > You should see something like
    > HASH(0x14d410)
    > It is this string that you were passing to the length function.
    > Obviously, the length of that string isn't going to change simply
    > because you removed one of the key/value pairs.
    > To determine the "size" (that is, number of keys) of a hash, you again
    > use the keys function, this time in sclar context:
    > print scalar(keys %{$hoh{$letter}});
    >
    > The additional punctuation around $hoh{$letter} is what we call
    > "dereferencing" a reference. You can read all about it by typing at
    > your console window:
    > perldoc perlreftut
    > perldoc perllol
    > perldoc perldsc
    >



    Say I want to look at the value of in this key --
    $hoh{$letter}{$value}, how do you print it?
    I tried, print $hoh{$letter}{$value}; - but it prints nothing....

    Thanks,
    Tak


    > Hope this helps,
    > Paul Lalli
    tak, Aug 2, 2006
    #6
  7. tak

    Ben Morrow Guest

    Quoth "tak" <>:
    >
    > wrote:
    > > "tak" <> wrote:
    > > > Hi,
    > > >
    > > > I have a script, that loads a txt file, with 240k lines in it to a hash
    > > > currently. And when it loads the data to the hash - it becomes slower
    > > > and slower when it reaches may be around 150k

    > >
    > > How much memory do you have? How much are you using at this point?

    >
    > >From looking at the PF Usage - it is about 1.9 GB. on a 1gb machine -

    > the available physical memory are down to about 10 MB when loading...
    > But the CPU usage remains about 5% only...


    So, you are thrashing. You've run out of memory: I would suggest using
    one of the DBM modules, probably DB_File. This stores the contents of
    the hash in a (structured, binary, fast-to-index) file on disk, which
    will probably make things faster.

    > > > (probably due to
    > > > collision, since perl's hash uses linear chaininig...).

    > >
    > > That is a rather unlikely bit of speculation, especially on a modern Perl.
    > > How many buckets does your hash have and use? (print scalar %hash).

    >
    > I have 1 main_hash, which stores 27 hashes in it. And out of each 27
    > hashes, it averages about 9k unique strings. print scalar %hash
    > reports, 23/32. What does this number mean?


    From perldoc perldata:

    | If you evaluate a hash in scalar context, it returns false if the hash
    | is empty. If there are any key/value pairs, it returns true; more
    | precisely, the value returned is a string consisting of the number of
    | used buckets and the number of allocated buckets, separated by a slash.
    | This is pretty much useful only to find out whether Perl's internal
    | hashing algorithm is performing poorly on your data set. For example,
    | you stick 10,000 things in a hash, but evaluating %HASH in scalar
    | context reveals "1/16", which means only one out of sixteen buckets has
    | been touched, and presumably contains all 10,000 of your items. This
    | isn't supposed to happen.

    (Note that this is not meant as a rebuke: noone can be expected to have
    all the arcana in Perl's std docs memorized. It is meant so that you may
    remember where to find it next time :). )

    So, your main hash is using 23 buckets to store your 27 subhashes... not
    such a useful thing to know :). The real question is, how many buckets
    does your original hash (with all the data in it) use? For instance, on
    my perl

    my %h;

    for (1..240_000) {
    $h{$_} = 1;
    }

    print scalar %h;

    prints '157199/262144', so the hash is using 157199 buckets, and each
    bucket has on average 240000/157199 ~~ 1.5 entries in it, which should
    not be a problem.

    > > When the facts don't fit your theory, re-examing your theory. You probably
    > > have a swapping problem, not a hash collision problem. And if you do have
    > > a collision problem, the better way to fix it would be to start out with a
    > > higher number of buckets, by assigning to the keys function.
    > >

    >
    > Can you elaborate on what you mean by a swapping problem?


    Your system has started thrashing: the working set (the pages in current
    use) has exceeded the size of physical memory, and the system is
    spending all its time swapping things in and out.

    > And I thought
    > about assigning higher number of bucket to the hash itself , but i
    > cannot find the related function to set that... I am a Java programmer,
    > and this is my first perl script.. I tried looking into the constructor
    > for the hash itself, but it doesnt seem like it accepts argument...?


    The next para after my previous quote:

    | You can preallocate space for a hash by assigning to the keys()
    | function. This rounds up the allocated buckets to the next power of two:
    |
    | keys(%users) = 1000; # allocate 1024 buckets

    > Last question,
    >
    > How Do you delete an element within a hoh? Say i have a hash of hash,
    > like the following.
    >
    > my %hoh();


    Did you even try this? Perl Is Not Java: this is a syntax error. You
    don't need the parens.

    > loop() { # say this is the loop of each line of my txtFile


    What is this 'loop()'? Have you been reading about Perl6? Or did you
    mean

    sub loop {

    ?

    > my $value = "TheRecordFromMyTxtFile";


    (You really want to sort out your indentation. Makes life easier for
    both you and us.)

    > my $letter = substr $value, 0, 1; # say, i am using the first letter
    > as the key for subhash.
    > my $myKey = substr $value, 5, 9; # Say position 5 - 9 is the key for
    > the element.
    > $hoh{$letter}{$myKey} = $value
    > }
    >
    >
    > Now, I want to delete a particular value from one of the subhash...
    >
    > I tried doing this,
    >
    > delete $hoh{$letter}{$value};


    That's correct (assuming $value corresponds to $myKey in the above, not
    to $value there: that is, you delete an element by specifying its key).

    > But it doesnt seem like it is deleting... B/c if I try to get the
    > length of the $hoh{$letter}, it still reports the same number...


    You really need to learn some basic Perl. I'd recommend a book:
    'Learning Perl' published by O'Reilly is universally recommended as a
    good place to start. An alternative would be to read through the
    perldocs, but that's not an easy way to learn.

    length (see perldoc -f length) treats its argument as a string and
    returns the length of that string. $hoh{$letter} contains a hash
    *reference*: see perldoc perldsc and perldoc perlreftut for how
    multi-level data structures are implemented in Perl. Or, again, a decent
    book will cover it. Now, when you stringify a hash ref, you get
    something that looks like 'HASH(0x80142180)', which is basically
    useless, and is always the same length.

    To find the number of keys in a hash, you do as it says in perldoc -f
    length: 'scalar keys %hash'. This is somewhat complicated by the fact
    that what you have is not a hash but a hash ref, so we apply 'Use Rule
    1' from perlreftut:

    # an ordinary hash
    print scalar keys %hash;

    # replace the var name with { }
    print scalar keys %{ }

    # put the hashref inside the braces
    print scalar keys %{ $hoh{$letter} };

    Yes, I agree this is a little icky, but that's what you get when you
    graft complex data structures onto a language (Perl4) that doesn't
    really support them :).

    A useful tool for examining data structures is the module Data::Dumper
    (obviously, you want to run a test on a smaller dataset rather than
    dumping a hash of 240k entries).

    Ben

    --
    All persons, living or dead, are entirely coincidental.
    Kurt Vonnegut
    Ben Morrow, Aug 2, 2006
    #7
  8. tak

    Guest

    "tak" <> wrote:
    >
    > Without using the 27 hash of hashes, print
    > scalar %mainHash reports 16k / 23k. (Of course - it reported the
    > number, but I didnt take them down, just remember its 16k and 23k) That
    > is a hugh amount of collision.


    How many things where in the hash at the time you did the
    print scalar %mainHash? (print scalar keys %mainHash). If it had 150K
    things in it at the time, than there are 150K/16K or about 10 entries per
    bucket. Higher than I would expect but not aweful.

    > >
    > > I don't know what you were trying to give arguments to. If you mean
    > > simply the `my` keyword, then you are correct - you cannot pre-allocate
    > > buckets when you declare the hash. Instead, declare it, and then
    > > assign buckets, using the keys function as described above.

    >
    > I tried to do this, keys %main_hash = 300000; but it is still running
    > slow when it reaches 150000... perhaps it is not the collision problem,
    > as xho mentioned?


    No, it probably isn't collisions that is the problem. But to be sure (but
    mostly out of curiousity), what did print scalar %main_hash give you after
    you loaded 150000 into it with this pre-allocation?

    > > > I tried doing this,
    > > >
    > > > delete $hoh{$letter}{$value};

    > >
    > > That is precisely how you delete that particular value from that
    > > particular "subhash".
    > >

    >
    > Say I have the key of subhash, as $letter, and the item in subhash as,
    > $value.
    >
    > delete $hoh{$letter}{$value};
    >
    > This should delete that from the hash, right?


    Yes.
    ....

    > Say I want to look at the value of in this key --
    > $hoh{$letter}{$value}, how do you print it?
    > I tried, print $hoh{$letter}{$value}; - but it prints nothing....


    Um, is this before or after you deleted it?

    Are you using warnings? If so, did you get an uninitialized value
    warning?


    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    Usenet Newsgroup Service $9.95/Month 30GB
    , Aug 2, 2006
    #8
  9. tak

    Paul Lalli Guest

    tak wrote:
    > Paul Lalli wrote:
    > > tak wrote:
    > > > wrote:
    > > > > "tak" <> wrote:
    > > > I have 1 main_hash, which stores 27 hashes in it. And out of each 27
    > > > hashes, it averages about 9k unique strings. print scalar %hash
    > > > reports, 23/32. What does this number mean?

    > >
    > > It means that Perl has allocated 32 buckets for this hash, and that 23
    > > of them are currently in use. So, only 4 collissions in the "main"
    > > hash.

    >
    > Why 4 collision? Do you mean 32 - 23 = 9?? or b/c you knew that i have
    > 27 subhashes, so 27 - 23??


    Yes. You have 27 values in your hash. Your hash is using 23 buckets.
    Therefore, 4 buckets are being used twice.

    > Without using the 27 hash of hashes, print
    > scalar %mainHash reports 16k / 23k. (Of course - it reported the
    > number, but I didnt take them down, just remember its 16k and 23k) That
    > is a hugh amount of collision.


    So it would seem.

    > I tried to do this, keys %main_hash = 300000; but it is still running
    > slow when it reaches 150000... perhaps it is not the collision problem,
    > as xho mentioned?


    That would be my guess. Again, it seems to be your theory that is
    faulty. You assumed that the slowness was caused by collissions. The
    facts do not support that theory.

    > > > Now, I want to delete a particular value from one of the subhash...
    > > >
    > > > I tried doing this,
    > > >
    > > > delete $hoh{$letter}{$value};

    > >
    > > That is precisely how you delete that particular value from that
    > > particular "subhash".

    >
    > Say I have the key of subhash, as $letter, and the item in subhash as,
    > $value.
    >
    > delete $hoh{$letter}{$value};


    Your code does not match your description. In the above $letter is a
    key of the *main* hash, and $value is a key in the subhash
    %{$hoh{$letter}}.

    > This should delete that from the hash, right?


    That would delete the key/value pair in the hash %{$hoh{$letter}} which
    has the key $value.

    > Say I want to look at the value of in this key --
    > $hoh{$letter}{$value}, how do you print it?
    > I tried, print $hoh{$letter}{$value}; - but it prints nothing....


    Then that key does not exist in that hash, or that key's value in the
    hash is the empty string (or the undefined value). Are you using
    warnings? If so, printing a non-existing element of a hash should give
    you a warning. Please enable them if you are not. Again, your
    description above is not matching your code.

    It's time for you to show some *actual* code, so we can better help
    you. Please post a short-but-complete script that demonstrates one or
    more of the failures you are encountering.

    Also, it would be appreciated if you trimmed your replies down to only
    include the relevant bits of quoted material. Thank you.

    Paul Lalli
    Paul Lalli, Aug 2, 2006
    #9
  10. tak

    Paul Lalli Guest

    Ben Morrow wrote:
    > Quoth "tak" <>:
    > > But it doesnt seem like it is deleting... B/c if I try to get the
    > > length of the $hoh{$letter}, it still reports the same number...

    >
    > You really need to learn some basic Perl. I'd recommend a book:
    > 'Learning Perl' published by O'Reilly is universally recommended as a
    > good place to start.


    > $hoh{$letter} contains a hash
    > *reference*: see perldoc perldsc and perldoc perlreftut for how
    > multi-level data structures are implemented in Perl. Or, again, a decent
    > book will cover it.


    Yet, ironically, not the book you recommended. :-( Once the OP has
    finished with "Learning Perl", he should probably move on to
    "Intermediate Perl", which does cover multi-level data structures.

    Paul Lalli
    Paul Lalli, Aug 2, 2006
    #10
  11. tak

    Ben Morrow Guest

    Quoth "tak" <>:
    >
    > Paul Lalli wrote:
    > > tak wrote:
    > > > wrote:
    > > > > "tak" <> wrote:
    > > > I have 1 main_hash, which stores 27 hashes in it. And out of each 27
    > > > hashes, it averages about 9k unique strings. print scalar %hash
    > > > reports, 23/32. What does this number mean?

    > >
    > > It means that Perl has allocated 32 buckets for this hash, and that 23
    > > of them are currently in use. So, only 4 collissions in the "main"
    > > hash.

    >
    > Why 4 collision? Do you mean 32 - 23 = 9?? or b/c you knew that i have
    > 27 subhashes, so 27 - 23??


    The latter. Say you had a hash with 5 entries, and scalar %hash printed
    2/4 (it wouldn't actually, of course, perl would use 5 buckets, but say
    it did). Then you have four buckets, and two of them contain your five
    entries, perhaps like

    | 5 | | | |
    | 3 | 4 | | |
    | 1 | 2 | | |
    +---+---+---+---+

    So on average perl has to search through 5/2 = 2.5 entries once it has
    located the right bucket (assuming the hashing function is fair, which
    in general perl's is). Perl will tell you this if you ask nicely:

    print( (scalar keys %hash) / (scalar %hash) );

    This will give a warning about converting '23/32' to a number (you are
    using warnings, right?), but you can ignore that if it's just for
    debugging.

    > Without using the 27 hash of hashes, print
    > scalar %mainHash reports 16k / 23k. (Of course - it reported the
    > number, but I didnt take them down, just remember its 16k and 23k) That
    > is a hugh amount of collision.


    This is your hash with 240k entries in it? That's still only an average
    of 15 elements per bucket. Still, you could try preallocating and see
    if it makes a difference.

    > > I don't know what you were trying to give arguments to. If you mean
    > > simply the `my` keyword, then you are correct - you cannot pre-allocate
    > > buckets when you declare the hash. Instead, declare it, and then
    > > assign buckets, using the keys function as described above.

    >
    > I tried to do this, keys %main_hash = 300000;


    There's little point in creating more buckets than you will have
    elements. That's just a waste of space. Note that Perl allows you to
    write '300000' as '300_000', which is much more readable.

    > but it is still running
    > slow when it reaches 150000... perhaps it is not the collision problem,
    > as xho mentioned?


    Indeed.

    > > > Now, I want to delete a particular value from one of the subhash...
    > > >
    > > > I tried doing this,
    > > >
    > > > delete $hoh{$letter}{$value};

    > >
    > > That is precisely how you delete that particular value from that
    > > particular "subhash".

    >
    > Say I have the key of subhash, as $letter, and the item in subhash as,
    > $value.
    >
    > delete $hoh{$letter}{$value};
    >
    > This should delete that from the hash, right?

    <snippage>

    > Say I want to look at the value of in this key --
    > $hoh{$letter}{$value}, how do you print it?
    > I tried, print $hoh{$letter}{$value}; - but it prints nothing....


    Both those examples are correct, as far as I can tell. I think you have
    some confusion somewhere else (and I'm still a little worried by your use
    of '$value' to refer to a key: are you confusing keys and values?). Can
    you show us a short complete script, with (a short set of) input data,
    that doesn't do what you expect? ('Short' in this case probably means
    <15 lines.)

    Ben

    --
    "If a book is worth reading when you are six, *
    it is worth reading when you are sixty." [C.S.Lewis]
    Ben Morrow, Aug 2, 2006
    #11
  12. tak

    Ben Morrow Guest

    Quoth "Paul Lalli" <>:
    > Ben Morrow wrote:
    > > Quoth "tak" <>:
    > > > But it doesnt seem like it is deleting... B/c if I try to get the
    > > > length of the $hoh{$letter}, it still reports the same number...

    > >
    > > You really need to learn some basic Perl. I'd recommend a book:
    > > 'Learning Perl' published by O'Reilly is universally recommended as a
    > > good place to start.

    >
    > > $hoh{$letter} contains a hash
    > > *reference*: see perldoc perldsc and perldoc perlreftut for how
    > > multi-level data structures are implemented in Perl. Or, again, a decent
    > > book will cover it.

    >
    > Yet, ironically, not the book you recommended. :-(


    Ah, my apologies. I've never actually read either the Llama or the
    Alpaca; I learned Perl from the Camel, here, and perl.plover.com :) .

    Ben

    --
    If I were a butterfly I'd live for a day, / I would be free, just blowing away.
    This cruel country has driven me down / Teased me and lied, teased me and lied.
    I've only sad stories to tell to this town: / My dreams have withered and died.
    (Kate Rusby)
    Ben Morrow, Aug 2, 2006
    #12
  13. tak

    tak Guest

    > > Without using the 27 hash of hashes, print
    > > scalar %mainHash reports 16k / 23k. (Of course - it reported the
    > > number, but I didnt take them down, just remember its 16k and 23k) That
    > > is a hugh amount of collision.

    >
    > So it would seem.
    >


    Sorry - i had a typo - i meant, 160k, and 230k... not 16k and 23k..


    > It's time for you to show some *actual* code, so we can better help
    > you. Please post a short-but-complete script that demonstrates one or
    > more of the failures you are encountering.
    >


    I will put one up in 10 minutes.

    > Also, it would be appreciated if you trimmed your replies down to only
    > include the relevant bits of quoted material. Thank you.
    >


    Ok...Sorry.
    tak, Aug 2, 2006
    #13
  14. tak

    tak Guest

    > How many things where in the hash at the time you did the
    > print scalar %mainHash? (print scalar keys %mainHash). If it had 150K
    > things in it at the time, than there are 150K/16K or about 10 entries per
    > bucket. Higher than I would expect but not aweful.
    >


    At the time I print out scalar keys %mainHash - i had 240k entries in
    it. (This is the one-big-hash implementation). And print scalar keys
    %mainHash reports 160k/230k - not 16 and 23k... i missed out a 0... so
    that means its about 1 entry per slot -- not too much collision?

    I printed out the scalar keys %mainHash, and each of the subhashes
    AFTER all 240k entries were inserted. This is a cut and paste..

    238348 lines in total..
    Item: @ - Size: 9 - ScalarSize: 7/16
    Item: A - Size: 12750 - ScalarSize: 8859/16384
    Item: B - Size: 9309 - ScalarSize: 7084/16384
    Item: C - Size: 14898 - ScalarSize: 9854/16384
    Item: D - Size: 8878 - ScalarSize: 6865/16384
    Item: E - Size: 7341 - ScalarSize: 4904/8192
    Item: F - Size: 7557 - ScalarSize: 4954/8192
    Item: G - Size: 7109 - ScalarSize: 4733/8192
    Item: H - Size: 7399 - ScalarSize: 4854/8192
    Item: I - Size: 12290 - ScalarSize: 8639/16384
    Item: J - Size: 3898 - ScalarSize: 2516/4096
    Item: K - Size: 4596 - ScalarSize: 3494/8192
    Item: L - Size: 8549 - ScalarSize: 6649/16384
    Item: M - Size: 10149 - ScalarSize: 7598/16384
    Item: N - Size: 7725 - ScalarSize: 5044/8192
    Item: O - Size: 8791 - ScalarSize: 6818/16384
    Item: P - Size: 11573 - ScalarSize: 8262/16384
    Item: Q - Size: 12397 - ScalarSize: 8721/16384
    Item: R - Size: 8247 - ScalarSize: 6464/16384
    Item: S - Size: 13921 - ScalarSize: 9348/16384
    Item: T - Size: 8995 - ScalarSize: 6884/16384
    Item: U - Size: 10667 - ScalarSize: 7799/16384
    Item: V - Size: 9166 - ScalarSize: 7058/16384
    Item: W - Size: 11293 - ScalarSize: 8142/16384
    Item: X - Size: 6617 - ScalarSize: 4555/8192
    Item: Y - Size: 8704 - ScalarSize: 6717/16384
    Item: Z - Size: 4167 - ScalarSize: 3277/8192
    scalar of master_hash is: 27/524288

    This data is being printed by the following code,

    for my $item ( sort keys %mainHash ) {
    my $size = keys(%{$mainHash{$item}});
    my $scalarSize = scalar %{$mainHash{$item}};
    print "Item: $item - Size: $size - ScalarSize:
    $scalarSize\n";
    }
    print "scalar of master_hash is: ";
    print scalar %Master_Hash;
    print "\n";

    By looking at the scalarSize and my PF usage - It does seem like it is
    the memory issue, by loading too much data that the system can handle?
    Situations like that - Does it matter if i do keys(%mainHash) = 300000
    or not?



    > Are you using warnings? If so, did you get an uninitialized value
    > warning?
    >


    Perhaps not - how do you turn on warnings?

    Thanks,
    Tak
    tak, Aug 2, 2006
    #14
  15. tak

    Paul Lalli Guest

    Ben Morrow wrote:
    > Ah, my apologies. I've never actually read either the Llama or the
    > Alpaca; I learned Perl from the Camel, here, and perl.plover.com :) .


    They're both great books, but the Llama is, unfortunately, *very*
    beginner-oriented. It does not cover references, multi-dimensional
    structures, or even passing arrays or hashes to subroutines (because,
    obviously, that involves references).

    Paul Lalli
    Paul Lalli, Aug 2, 2006
    #15
  16. tak

    Ben Morrow Guest

    Quoth "tak" <>:
    > > How many things where in the hash at the time you did the
    > > print scalar %mainHash? (print scalar keys %mainHash). If it had 150K
    > > things in it at the time, than there are 150K/16K or about 10 entries per
    > > bucket. Higher than I would expect but not aweful.

    >
    > At the time I print out scalar keys %mainHash - i had 240k entries in
    > it. (This is the one-big-hash implementation). And print scalar keys
    > %mainHash reports 160k/230k - not 16 and 23k... i missed out a 0... so
    > that means its about 1 entry per slot -- not too much collision?


    That's more like what I would have expected (with no basis whatever
    other than faith in perl's hashing algorithm :)). Exactly one entry per
    slot is perfect hashing: no searching at all within the bucket. Not much
    more than that means not much searching.

    > By looking at the scalarSize and my PF usage - It does seem like it is
    > the memory issue, by loading too much data that the system can handle?
    > Situations like that - Does it matter if i do keys(%mainHash) = 300000
    > or not?


    Well... preallocating too much will only make things worse.
    Preallocating at all is a dodgy business (you have to calculate how much
    to preallocate, which makes the code more complicated and thus more
    fragile; and less readable), so don't do it if bucket allocation isn't
    your problem, which it seems not to be.

    > > Are you using warnings? If so, did you get an uninitialized value
    > > warning?

    >
    > Perhaps not - how do you turn on warnings?


    Have you read the Posting Guidelines? They cover some basic stuff like
    this. At the top of every script you write you should have

    use strict;
    use warnings;

    which turn on a whole lot of diagnostics which are usually essential but
    often annoying for one-liners.

    Ben

    --
    The cosmos, at best, is like a rubbish heap scattered at random.
    Heraclitus
    Ben Morrow, Aug 2, 2006
    #16
  17. tak

    Guest

    "tak" <> wrote:
    > > How many things where in the hash at the time you did the
    > > print scalar %mainHash? (print scalar keys %mainHash). If it had 150K
    > > things in it at the time, than there are 150K/16K or about 10 entries
    > > per bucket. Higher than I would expect but not aweful.
    > >

    >
    > At the time I print out scalar keys %mainHash - i had 240k entries in
    > it. (This is the one-big-hash implementation). And print scalar keys
    > %mainHash reports 160k/230k - not 16 and 23k... i missed out a 0... so
    > that means its about 1 entry per slot -- not too much collision?


    Right, collisions are not your problem.

    > I printed out the scalar keys %mainHash, and each of the subhashes
    > AFTER all 240k entries were inserted. This is a cut and paste..
    >
    > 238348 lines in total..
    > Item: @ - Size: 9 - ScalarSize: 7/16
    > Item: A - Size: 12750 - ScalarSize: 8859/16384

    ....
    > Item: Z - Size: 4167 - ScalarSize: 3277/8192
    > scalar of master_hash is: 27/524288


    So it looks like you preallocted the master hash to have 300_000 buckets
    (which got rounded up to the next power of two, 525288). There is
    certainly no benefit in doing that, as the master hash only holds 27
    things. Of course this is irrelevant as we now know there is no reason to
    use the two-level has in the first place!


    > By looking at the scalarSize and my PF usage - It does seem like it is
    > the memory issue, by loading too much data that the system can handle?


    Yes. Either loading to much data, or storing too much, or storing it in
    an inefficient way.

    > Situations like that - Does it matter if i do keys(%mainHash) = 300000
    > or not?


    It can only make things worse. You are creating 299973 buckets that take
    up some amount of memory but which will never be used. If you revert to
    the single hash implementation, then pre-allocating that giant hash will
    not hurt. It might help, but probably not by a meaningful amount.
    Preallocating hashes (or arrays) almost *never* helps in a meaningful way.
    I've only seen it help twice in non-toy cases, and both of those were on
    older versions of Perl--one with bad hashing function that degenerated when
    given keys which were all a multiple of 3, and the other was a
    bug/malfeature in the garbage collector.

    >
    > > Are you using warnings? If so, did you get an uninitialized value
    > > warning?
    > >

    >
    > Perhaps not - how do you turn on warnings?


    The prefered way is to "use warnings;". On older Perls, you specify
    -w on the command line or the shebang line.

    #!/usr/bin/perl -w

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    Usenet Newsgroup Service $9.95/Month 30GB
    , Aug 3, 2006
    #17
  18. tak

    -berlin.de Guest

    <> wrote in comp.lang.perl.misc:
    > "tak" <> wrote:


    [...]

    > > Say I have the key of subhash, as $letter, and the item in subhash as,
    > > $value.
    > >
    > > delete $hoh{$letter}{$value};
    > >
    > > This should delete that from the hash, right?

    >
    > Yes.
    > ...


    I find the question too vague to be answered unconditionally.

    It deletes the entry associated with the key $value in the anonymous
    hash $hoh{$letter}. It does not delete anything from %hoh, even if
    the entry $value was the last one in $hoh{$letter}. That may or may
    not matter for the program, but in another part of the thread the OP
    seemed to expect the outer entry to go away. It doesn't without
    help from the programmer.

    Anno
    -berlin.de, Aug 4, 2006
    #18
  19. tak

    -berlin.de Guest

    Ben Morrow <> wrote in comp.lang.perl.misc:
    >
    > Quoth "tak" <>:
    > >
    > > wrote:
    > > > "tak" <> wrote:
    > > > > Hi,
    > > > >
    > > > > I have a script, that loads a txt file, with 240k lines in it to a hash
    > > > > currently. And when it loads the data to the hash - it becomes slower
    > > > > and slower when it reaches may be around 150k
    > > >
    > > > How much memory do you have? How much are you using at this point?

    > >
    > > >From looking at the PF Usage - it is about 1.9 GB. on a 1gb machine -

    > > the available physical memory are down to about 10 MB when loading...
    > > But the CPU usage remains about 5% only...

    >
    > So, you are thrashing. You've run out of memory: I would suggest using
    > one of the DBM modules, probably DB_File. This stores the contents of
    > the hash in a (structured, binary, fast-to-index) file on disk, which
    > will probably make things faster.


    But should it grow to almost two GB in the first place? We have no
    idea how long the OPs lines are, but assuming something ordinary like
    80 bytes that's a bit over 18 MB. Even Perl shouldn't blow it up
    that much. Or is some other memory hog running? It sounds like the
    program isn't doing much except reading the data. That shouldn't
    bring a two-gig machine down, unless the lines are much longer,
    (about 8 KB) of course.

    Anno
    -berlin.de, Aug 4, 2006
    #19
  20. [A complimentary Cc of this posting was sent to

    <-berlin.de>], who wrote in article <>:
    > > > > > I have a script, that loads a txt file, with 240k lines in it to a hash


    > > So, you are thrashing. You've run out of memory: I would suggest using
    > > one of the DBM modules, probably DB_File. This stores the contents of
    > > the hash in a (structured, binary, fast-to-index) file on disk, which
    > > will probably make things faster.


    > But should it grow to almost two GB in the first place? We have no
    > idea how long the OPs lines are, but assuming something ordinary like
    > 80 bytes that's a bit over 18 MB. Even Perl shouldn't blow it up
    > that much. Or is some other memory hog running? It sounds like the
    > program isn't doing much except reading the data. That shouldn't
    > bring a two-gig machine down, unless the lines are much longer,
    > (about 8 KB) of course.


    Note that hashes are quite memory-hungry (I would say circa 80bytes
    per key total overhead - with a very good malloc() implementation).
    So with a VERY LOUSY malloc() implementation (e.g., one which has 4K
    minimal allocation unit), it could allocate two 4K chunks per key -
    one for key, another for value. This is exactly 8K of your calculation.

    And I actually SAW such malloc()s widely used - some kind of debugging
    tool which would put all chunks on separate pages (so that bound
    checking is simplified). Of course, we have no way to know that this
    is what is used by the OP...

    Hope this helps,
    Ilya
    Ilya Zakharevich, Aug 4, 2006
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Bryan Krone

    perl efficiency -- fastest grepping?

    Bryan Krone, Nov 5, 2004, in forum: Perl
    Replies:
    1
    Views:
    1,459
    Jim Gibson
    Nov 8, 2004
  2. rp
    Replies:
    1
    Views:
    477
    red floyd
    Nov 10, 2011
  3. Srijayanth Sridhar
    Replies:
    19
    Views:
    580
    David A. Black
    Jul 2, 2008
  4. odigity

    perl hash speed and memory efficiency

    odigity, May 27, 2004, in forum: Perl Misc
    Replies:
    1
    Views:
    117
    Ben Morrow
    May 27, 2004
  5. Bryan Krone

    perl efficiency -- fastest grepping?

    Bryan Krone, Nov 16, 2004, in forum: Perl Misc
    Replies:
    5
    Views:
    132
    Uri Guttman
    Nov 16, 2004
Loading...

Share This Page