Perl hash of hash efficiency.

T

tak

Hi,

I have a script, that loads a txt file, with 240k lines in it to a hash
currently. And when it loads the data to the hash - it becomes slower
and slower when it reaches may be around 150k (probably due to
collision, since perl's hash uses linear chaininig...).

So, i am thinking of implementing it in hash of hash... using the first
letter of these records, and divide them into 26 sub hashes. (Since
these records' first letter are pretty random from A-Z).

So, I tried it, the performance is about the same... why??

The difference between what i had, and what i am doing now is,

Before:
$mainHash{$somekey} = \@values;

After:
my $letter = substr $values, 0 , 1;
$mainHash{$letter}{$somekey} = \@values;

Shouldnt the "after" format helps the memory usage / efficiency by
alot? B/c it lowered the collision by 26 times...???

Thanks,
T
 
X

xhoster

tak said:
Hi,

I have a script, that loads a txt file, with 240k lines in it to a hash
currently. And when it loads the data to the hash - it becomes slower
and slower when it reaches may be around 150k

How much memory do you have? How much are you using at this point?
(probably due to
collision, since perl's hash uses linear chaininig...).

That is a rather unlikely bit of speculation, especially on a modern Perl.
How many buckets does your hash have and use? (print scalar %hash).
So, i am thinking of implementing it in hash of hash... using the first
letter of these records, and divide them into 26 sub hashes. (Since
these records' first letter are pretty random from A-Z).

So, I tried it, the performance is about the same... why??

Solving the imagined problem rarely solves the real problem. (And if that
were the problem it was that easy to fix, don't you think Perl would
already have made that fix itself in the core hashing code?)

When the facts don't fit your theory, re-examing your theory. You probably
have a swapping problem, not a hash collision problem. And if you do have
a collision problem, the better way to fix it would be to start out with a
higher number of buckets, by assigning to the keys function.

Xho
 
T

tak

How much memory do you have? How much are you using at this point?
From looking at the PF Usage - it is about 1.9 GB. on a 1gb machine -
the available physical memory are down to about 10 MB when loading...
But the CPU usage remains about 5% only...

That is a rather unlikely bit of speculation, especially on a modern Perl.
How many buckets does your hash have and use? (print scalar %hash).

I have 1 main_hash, which stores 27 hashes in it. And out of each 27
hashes, it averages about 9k unique strings. print scalar %hash
reports, 23/32. What does this number mean?

Solving the imagined problem rarely solves the real problem. (And if that
were the problem it was that easy to fix, don't you think Perl would
already have made that fix itself in the core hashing code?)

So, what do you suggest?

When the facts don't fit your theory, re-examing your theory. You probably
have a swapping problem, not a hash collision problem. And if you do have
a collision problem, the better way to fix it would be to start out with a
higher number of buckets, by assigning to the keys function.

Can you elaborate on what you mean by a swapping problem? And I thought
about assigning higher number of bucket to the hash itself , but i
cannot find the related function to set that... I am a Java programmer,
and this is my first perl script.. I tried looking into the constructor
for the hash itself, but it doesnt seem like it accepts argument...?

Last question,

How Do you delete an element within a hoh? Say i have a hash of hash,
like the following.

my %hoh();

loop() { # say this is the loop of each line of my txtFile
my $value = "TheRecordFromMyTxtFile";
my $letter = substr $value, 0, 1; # say, i am using the first letter
as the key for subhash.
my $myKey = substr $value, 5, 9; # Say position 5 - 9 is the key for
the element.
$hoh{$letter}{$myKey} = $value
}


Now, I want to delete a particular value from one of the subhash...

I tried doing this,

delete $hoh{$letter}{$value};

But it doesnt seem like it is deleting... B/c if I try to get the
length of the $hoh{$letter}, it still reports the same number...


Thanks!
tak
 
P

Paul Lalli

tak said:
I have 1 main_hash, which stores 27 hashes in it. And out of each 27
hashes, it averages about 9k unique strings. print scalar %hash
reports, 23/32. What does this number mean?

It means that Perl has allocated 32 buckets for this hash, and that 23
of them are currently in use. So, only 4 collissions in the "main"
hash.
And I thought
about assigning higher number of bucket to the hash itself , but i
cannot find the related function to set that...

Xho just gave it to you. The `keys` function. If you read the Perl
documentation on this function, by typing at your console window:
perldoc -f keys
you will find:
====================
As an lvalue "keys" allows you to increase the
number of hash buckets allocated for the given hash.
This can gain you a measure of efficiency if you
know the hash is going to get big. (This is similar
to pre-extending an array by assigning a larger
number to $#array.) If you say

keys %hash = 200;

then %hash will have at least 200 buckets allocated
for it--256 of them, in fact, since it rounds up to
the next power of two.
====================
I am a Java programmer, and this is my first perl script..

Welcome to the world of Perl. You'll love it, I promise. :)
I tried looking into the constructor for the hash itself,

There is no such thing. Constructors are methods of classes. Hashes
are native data types, not objects. They are simply declared and used.
but it doesnt seem like it accepts argument...?

I don't know what you were trying to give arguments to. If you mean
simply the `my` keyword, then you are correct - you cannot pre-allocate
buckets when you declare the hash. Instead, declare it, and then
assign buckets, using the keys function as described above.
How Do you delete an element within a hoh? Say i have a hash of hash,
like the following.

my %hoh();

loop() { # say this is the loop of each line of my txtFile

I don't know what this means. This is not Perl code. Please show real
code whenever possible.
my $value = "TheRecordFromMyTxtFile";
my $letter = substr $value, 0, 1; # say, i am using the first letter
as the key for subhash.
my $myKey = substr $value, 5, 9; # Say position 5 - 9 is the key for
the element.
$hoh{$letter}{$myKey} = $value
}


Now, I want to delete a particular value from one of the subhash...

I tried doing this,

delete $hoh{$letter}{$value};

That is precisely how you delete that particular value from that
particular "subhash".
But it doesnt seem like it is deleting... B/c if I try to get the
length of the $hoh{$letter}, it still reports the same number...

What, exactly, do you mean by "get the length of the $hoh{$letter}"?
Again, please show real code whenever possible. Did you, by any
chance, do something like:
print length $hoh{$letter};
?

That does not, at all, give you what you want. When you use a
reference in a scalar context, you get a string representing the type
of reference and it's memory address. To see what I mean, try printing
out:
print $hoh{$letter};
You should see something like
HASH(0x14d410)
It is this string that you were passing to the length function.
Obviously, the length of that string isn't going to change simply
because you removed one of the key/value pairs.
To determine the "size" (that is, number of keys) of a hash, you again
use the keys function, this time in sclar context:
print scalar(keys %{$hoh{$letter}});

The additional punctuation around $hoh{$letter} is what we call
"dereferencing" a reference. You can read all about it by typing at
your console window:
perldoc perlreftut
perldoc perllol
perldoc perldsc

Hope this helps,
Paul Lalli
 
X

xhoster

tak said:
the available physical memory are down to about 10 MB when loading...
But the CPU usage remains about 5% only...

If I read this right, you are using 1.9GB of virtual memory but you only
have 1.0GB of RAM, and so have swapped out 900MB? And you have only 10MB
of *free* memory? That means you are probably swapping like crazy, and
thus the CPU load is low as it spends most of the time waiting for disk.
I have 1 main_hash, which stores 27 hashes in it. And out of each 27
hashes, it averages about 9k unique strings. print scalar %hash
reports, 23/32. What does this number mean?

There are 32 buckets, of which 23 have at least one value. So there is
either 1 five-way collision, or 4 two-way collisions, or somewhere in
between. But that doesn't really tell you much. I meant for you to get
this number on the one-big-hash implementation, before you changed over to
the hash of hashes.
So, what do you suggest?

Identifying the real problem first :)

Right now, I'd say that that is memory. So you could try to find a more
memory efficient way to hold your data. Or make a multi-pass approach.
Or maybe tie the hash to disk explicitly. Or use a database to hold the
data. (Or buy more memory)

Can you elaborate on what you mean by a swapping problem?

Your OS has a virtual memory system that lets you use more memory than
you actually have, by swapping/paging out "unused" memory to disk. But it
can get very, very slow if you are actively using more pages than fit in
real memory.

(Paul answered the rest of your questions)

Xho
 
T

tak

Paul said:
It means that Perl has allocated 32 buckets for this hash, and that 23
of them are currently in use. So, only 4 collissions in the "main"
hash.

Why 4 collision? Do you mean 32 - 23 = 9?? or b/c you knew that i have
27 subhashes, so 27 - 23?? Without using the 27 hash of hashes, print
scalar %mainHash reports 16k / 23k. (Of course - it reported the
number, but I didnt take them down, just remember its 16k and 23k) That
is a hugh amount of collision.

Xho just gave it to you. The `keys` function. If you read the Perl
documentation on this function, by typing at your console window:
perldoc -f keys
you will find:
====================
As an lvalue "keys" allows you to increase the
number of hash buckets allocated for the given hash.
This can gain you a measure of efficiency if you
know the hash is going to get big. (This is similar
to pre-extending an array by assigning a larger
number to $#array.) If you say

keys %hash = 200;

then %hash will have at least 200 buckets allocated
for it--256 of them, in fact, since it rounds up to
the next power of two.
====================

Welcome to the world of Perl. You'll love it, I promise. :)


There is no such thing. Constructors are methods of classes. Hashes
are native data types, not objects. They are simply declared and used.


I don't know what you were trying to give arguments to. If you mean
simply the `my` keyword, then you are correct - you cannot pre-allocate
buckets when you declare the hash. Instead, declare it, and then
assign buckets, using the keys function as described above.


I tried to do this, keys %main_hash = 300000; but it is still running
slow when it reaches 150000... perhaps it is not the collision problem,
as xho mentioned?


I don't know what this means. This is not Perl code. Please show real
code whenever possible.


That is precisely how you delete that particular value from that
particular "subhash".

Say I have the key of subhash, as $letter, and the item in subhash as,
$value.

delete $hoh{$letter}{$value};

This should delete that from the hash, right?



What, exactly, do you mean by "get the length of the $hoh{$letter}"?
Again, please show real code whenever possible. Did you, by any
chance, do something like:
print length $hoh{$letter};
?

That does not, at all, give you what you want. When you use a
reference in a scalar context, you get a string representing the type
of reference and it's memory address. To see what I mean, try printing
out:
print $hoh{$letter};
You should see something like
HASH(0x14d410)
It is this string that you were passing to the length function.
Obviously, the length of that string isn't going to change simply
because you removed one of the key/value pairs.
To determine the "size" (that is, number of keys) of a hash, you again
use the keys function, this time in sclar context:
print scalar(keys %{$hoh{$letter}});

The additional punctuation around $hoh{$letter} is what we call
"dereferencing" a reference. You can read all about it by typing at
your console window:
perldoc perlreftut
perldoc perllol
perldoc perldsc


Say I want to look at the value of in this key --
$hoh{$letter}{$value}, how do you print it?
I tried, print $hoh{$letter}{$value}; - but it prints nothing....

Thanks,
Tak
 
B

Ben Morrow

Quoth "tak said:
the available physical memory are down to about 10 MB when loading...
But the CPU usage remains about 5% only...

So, you are thrashing. You've run out of memory: I would suggest using
one of the DBM modules, probably DB_File. This stores the contents of
the hash in a (structured, binary, fast-to-index) file on disk, which
will probably make things faster.
I have 1 main_hash, which stores 27 hashes in it. And out of each 27
hashes, it averages about 9k unique strings. print scalar %hash
reports, 23/32. What does this number mean?

From perldoc perldata:

| If you evaluate a hash in scalar context, it returns false if the hash
| is empty. If there are any key/value pairs, it returns true; more
| precisely, the value returned is a string consisting of the number of
| used buckets and the number of allocated buckets, separated by a slash.
| This is pretty much useful only to find out whether Perl's internal
| hashing algorithm is performing poorly on your data set. For example,
| you stick 10,000 things in a hash, but evaluating %HASH in scalar
| context reveals "1/16", which means only one out of sixteen buckets has
| been touched, and presumably contains all 10,000 of your items. This
| isn't supposed to happen.

(Note that this is not meant as a rebuke: noone can be expected to have
all the arcana in Perl's std docs memorized. It is meant so that you may
remember where to find it next time :). )

So, your main hash is using 23 buckets to store your 27 subhashes... not
such a useful thing to know :). The real question is, how many buckets
does your original hash (with all the data in it) use? For instance, on
my perl

my %h;

for (1..240_000) {
$h{$_} = 1;
}

print scalar %h;

prints '157199/262144', so the hash is using 157199 buckets, and each
bucket has on average 240000/157199 ~~ 1.5 entries in it, which should
not be a problem.
Can you elaborate on what you mean by a swapping problem?

Your system has started thrashing: the working set (the pages in current
use) has exceeded the size of physical memory, and the system is
spending all its time swapping things in and out.
And I thought
about assigning higher number of bucket to the hash itself , but i
cannot find the related function to set that... I am a Java programmer,
and this is my first perl script.. I tried looking into the constructor
for the hash itself, but it doesnt seem like it accepts argument...?

The next para after my previous quote:

| You can preallocate space for a hash by assigning to the keys()
| function. This rounds up the allocated buckets to the next power of two:
|
| keys(%users) = 1000; # allocate 1024 buckets
Last question,

How Do you delete an element within a hoh? Say i have a hash of hash,
like the following.

my %hoh();

Did you even try this? Perl Is Not Java: this is a syntax error. You
don't need the parens.
loop() { # say this is the loop of each line of my txtFile

What is this 'loop()'? Have you been reading about Perl6? Or did you
mean

sub loop {

?
my $value = "TheRecordFromMyTxtFile";

(You really want to sort out your indentation. Makes life easier for
both you and us.)
my $letter = substr $value, 0, 1; # say, i am using the first letter
as the key for subhash.
my $myKey = substr $value, 5, 9; # Say position 5 - 9 is the key for
the element.
$hoh{$letter}{$myKey} = $value
}


Now, I want to delete a particular value from one of the subhash...

I tried doing this,

delete $hoh{$letter}{$value};

That's correct (assuming $value corresponds to $myKey in the above, not
to $value there: that is, you delete an element by specifying its key).
But it doesnt seem like it is deleting... B/c if I try to get the
length of the $hoh{$letter}, it still reports the same number...

You really need to learn some basic Perl. I'd recommend a book:
'Learning Perl' published by O'Reilly is universally recommended as a
good place to start. An alternative would be to read through the
perldocs, but that's not an easy way to learn.

length (see perldoc -f length) treats its argument as a string and
returns the length of that string. $hoh{$letter} contains a hash
*reference*: see perldoc perldsc and perldoc perlreftut for how
multi-level data structures are implemented in Perl. Or, again, a decent
book will cover it. Now, when you stringify a hash ref, you get
something that looks like 'HASH(0x80142180)', which is basically
useless, and is always the same length.

To find the number of keys in a hash, you do as it says in perldoc -f
length: 'scalar keys %hash'. This is somewhat complicated by the fact
that what you have is not a hash but a hash ref, so we apply 'Use Rule
1' from perlreftut:

# an ordinary hash
print scalar keys %hash;

# replace the var name with { }
print scalar keys %{ }

# put the hashref inside the braces
print scalar keys %{ $hoh{$letter} };

Yes, I agree this is a little icky, but that's what you get when you
graft complex data structures onto a language (Perl4) that doesn't
really support them :).

A useful tool for examining data structures is the module Data::Dumper
(obviously, you want to run a test on a smaller dataset rather than
dumping a hash of 240k entries).

Ben
 
X

xhoster

tak said:
Without using the 27 hash of hashes, print
scalar %mainHash reports 16k / 23k. (Of course - it reported the
number, but I didnt take them down, just remember its 16k and 23k) That
is a hugh amount of collision.

How many things where in the hash at the time you did the
print scalar %mainHash? (print scalar keys %mainHash). If it had 150K
things in it at the time, than there are 150K/16K or about 10 entries per
bucket. Higher than I would expect but not aweful.
I tried to do this, keys %main_hash = 300000; but it is still running
slow when it reaches 150000... perhaps it is not the collision problem,
as xho mentioned?

No, it probably isn't collisions that is the problem. But to be sure (but
mostly out of curiousity), what did print scalar %main_hash give you after
you loaded 150000 into it with this pre-allocation?
Say I have the key of subhash, as $letter, and the item in subhash as,
$value.

delete $hoh{$letter}{$value};

This should delete that from the hash, right?
Yes.
....

Say I want to look at the value of in this key --
$hoh{$letter}{$value}, how do you print it?
I tried, print $hoh{$letter}{$value}; - but it prints nothing....

Um, is this before or after you deleted it?

Are you using warnings? If so, did you get an uninitialized value
warning?


Xho
 
P

Paul Lalli

tak said:
Why 4 collision? Do you mean 32 - 23 = 9?? or b/c you knew that i have
27 subhashes, so 27 - 23??

Yes. You have 27 values in your hash. Your hash is using 23 buckets.
Therefore, 4 buckets are being used twice.
Without using the 27 hash of hashes, print
scalar %mainHash reports 16k / 23k. (Of course - it reported the
number, but I didnt take them down, just remember its 16k and 23k) That
is a hugh amount of collision.

So it would seem.
I tried to do this, keys %main_hash = 300000; but it is still running
slow when it reaches 150000... perhaps it is not the collision problem,
as xho mentioned?

That would be my guess. Again, it seems to be your theory that is
faulty. You assumed that the slowness was caused by collissions. The
facts do not support that theory.
Say I have the key of subhash, as $letter, and the item in subhash as,
$value.

delete $hoh{$letter}{$value};

Your code does not match your description. In the above $letter is a
key of the *main* hash, and $value is a key in the subhash
%{$hoh{$letter}}.
This should delete that from the hash, right?

That would delete the key/value pair in the hash %{$hoh{$letter}} which
has the key $value.
Say I want to look at the value of in this key --
$hoh{$letter}{$value}, how do you print it?
I tried, print $hoh{$letter}{$value}; - but it prints nothing....

Then that key does not exist in that hash, or that key's value in the
hash is the empty string (or the undefined value). Are you using
warnings? If so, printing a non-existing element of a hash should give
you a warning. Please enable them if you are not. Again, your
description above is not matching your code.

It's time for you to show some *actual* code, so we can better help
you. Please post a short-but-complete script that demonstrates one or
more of the failures you are encountering.

Also, it would be appreciated if you trimmed your replies down to only
include the relevant bits of quoted material. Thank you.

Paul Lalli
 
P

Paul Lalli

Ben said:
You really need to learn some basic Perl. I'd recommend a book:
'Learning Perl' published by O'Reilly is universally recommended as a
good place to start.
$hoh{$letter} contains a hash
*reference*: see perldoc perldsc and perldoc perlreftut for how
multi-level data structures are implemented in Perl. Or, again, a decent
book will cover it.

Yet, ironically, not the book you recommended. :-( Once the OP has
finished with "Learning Perl", he should probably move on to
"Intermediate Perl", which does cover multi-level data structures.

Paul Lalli
 
B

Ben Morrow

Quoth "tak said:
Why 4 collision? Do you mean 32 - 23 = 9?? or b/c you knew that i have
27 subhashes, so 27 - 23??

The latter. Say you had a hash with 5 entries, and scalar %hash printed
2/4 (it wouldn't actually, of course, perl would use 5 buckets, but say
it did). Then you have four buckets, and two of them contain your five
entries, perhaps like

| 5 | | | |
| 3 | 4 | | |
| 1 | 2 | | |
+---+---+---+---+

So on average perl has to search through 5/2 = 2.5 entries once it has
located the right bucket (assuming the hashing function is fair, which
in general perl's is). Perl will tell you this if you ask nicely:

print( (scalar keys %hash) / (scalar %hash) );

This will give a warning about converting '23/32' to a number (you are
using warnings, right?), but you can ignore that if it's just for
debugging.
Without using the 27 hash of hashes, print
scalar %mainHash reports 16k / 23k. (Of course - it reported the
number, but I didnt take them down, just remember its 16k and 23k) That
is a hugh amount of collision.

This is your hash with 240k entries in it? That's still only an average
of 15 elements per bucket. Still, you could try preallocating and see
if it makes a difference.
I tried to do this, keys %main_hash = 300000;

There's little point in creating more buckets than you will have
elements. That's just a waste of space. Note that Perl allows you to
write '300000' as '300_000', which is much more readable.
but it is still running
slow when it reaches 150000... perhaps it is not the collision problem,
as xho mentioned?
Indeed.


Say I have the key of subhash, as $letter, and the item in subhash as,
$value.

delete $hoh{$letter}{$value};

This should delete that from the hash, right?
Say I want to look at the value of in this key --
$hoh{$letter}{$value}, how do you print it?
I tried, print $hoh{$letter}{$value}; - but it prints nothing....

Both those examples are correct, as far as I can tell. I think you have
some confusion somewhere else (and I'm still a little worried by your use
of '$value' to refer to a key: are you confusing keys and values?). Can
you show us a short complete script, with (a short set of) input data,
that doesn't do what you expect? ('Short' in this case probably means
<15 lines.)

Ben
 
B

Ben Morrow

Quoth "Paul Lalli said:
Yet, ironically, not the book you recommended. :-(

Ah, my apologies. I've never actually read either the Llama or the
Alpaca; I learned Perl from the Camel, here, and perl.plover.com :) .

Ben
 
T

tak

Without using the 27 hash of hashes, print
So it would seem.

Sorry - i had a typo - i meant, 160k, and 230k... not 16k and 23k..

It's time for you to show some *actual* code, so we can better help
you. Please post a short-but-complete script that demonstrates one or
more of the failures you are encountering.

I will put one up in 10 minutes.
Also, it would be appreciated if you trimmed your replies down to only
include the relevant bits of quoted material. Thank you.

Ok...Sorry.
 
T

tak

How many things where in the hash at the time you did the
print scalar %mainHash? (print scalar keys %mainHash). If it had 150K
things in it at the time, than there are 150K/16K or about 10 entries per
bucket. Higher than I would expect but not aweful.

At the time I print out scalar keys %mainHash - i had 240k entries in
it. (This is the one-big-hash implementation). And print scalar keys
%mainHash reports 160k/230k - not 16 and 23k... i missed out a 0... so
that means its about 1 entry per slot -- not too much collision?

I printed out the scalar keys %mainHash, and each of the subhashes
AFTER all 240k entries were inserted. This is a cut and paste..

238348 lines in total..
Item: @ - Size: 9 - ScalarSize: 7/16
Item: A - Size: 12750 - ScalarSize: 8859/16384
Item: B - Size: 9309 - ScalarSize: 7084/16384
Item: C - Size: 14898 - ScalarSize: 9854/16384
Item: D - Size: 8878 - ScalarSize: 6865/16384
Item: E - Size: 7341 - ScalarSize: 4904/8192
Item: F - Size: 7557 - ScalarSize: 4954/8192
Item: G - Size: 7109 - ScalarSize: 4733/8192
Item: H - Size: 7399 - ScalarSize: 4854/8192
Item: I - Size: 12290 - ScalarSize: 8639/16384
Item: J - Size: 3898 - ScalarSize: 2516/4096
Item: K - Size: 4596 - ScalarSize: 3494/8192
Item: L - Size: 8549 - ScalarSize: 6649/16384
Item: M - Size: 10149 - ScalarSize: 7598/16384
Item: N - Size: 7725 - ScalarSize: 5044/8192
Item: O - Size: 8791 - ScalarSize: 6818/16384
Item: P - Size: 11573 - ScalarSize: 8262/16384
Item: Q - Size: 12397 - ScalarSize: 8721/16384
Item: R - Size: 8247 - ScalarSize: 6464/16384
Item: S - Size: 13921 - ScalarSize: 9348/16384
Item: T - Size: 8995 - ScalarSize: 6884/16384
Item: U - Size: 10667 - ScalarSize: 7799/16384
Item: V - Size: 9166 - ScalarSize: 7058/16384
Item: W - Size: 11293 - ScalarSize: 8142/16384
Item: X - Size: 6617 - ScalarSize: 4555/8192
Item: Y - Size: 8704 - ScalarSize: 6717/16384
Item: Z - Size: 4167 - ScalarSize: 3277/8192
scalar of master_hash is: 27/524288

This data is being printed by the following code,

for my $item ( sort keys %mainHash ) {
my $size = keys(%{$mainHash{$item}});
my $scalarSize = scalar %{$mainHash{$item}};
print "Item: $item - Size: $size - ScalarSize:
$scalarSize\n";
}
print "scalar of master_hash is: ";
print scalar %Master_Hash;
print "\n";

By looking at the scalarSize and my PF usage - It does seem like it is
the memory issue, by loading too much data that the system can handle?
Situations like that - Does it matter if i do keys(%mainHash) = 300000
or not?


Are you using warnings? If so, did you get an uninitialized value
warning?

Perhaps not - how do you turn on warnings?

Thanks,
Tak
 
P

Paul Lalli

Ben said:
Ah, my apologies. I've never actually read either the Llama or the
Alpaca; I learned Perl from the Camel, here, and perl.plover.com :) .

They're both great books, but the Llama is, unfortunately, *very*
beginner-oriented. It does not cover references, multi-dimensional
structures, or even passing arrays or hashes to subroutines (because,
obviously, that involves references).

Paul Lalli
 
B

Ben Morrow

Quoth "tak said:
At the time I print out scalar keys %mainHash - i had 240k entries in
it. (This is the one-big-hash implementation). And print scalar keys
%mainHash reports 160k/230k - not 16 and 23k... i missed out a 0... so
that means its about 1 entry per slot -- not too much collision?

That's more like what I would have expected (with no basis whatever
other than faith in perl's hashing algorithm :)). Exactly one entry per
slot is perfect hashing: no searching at all within the bucket. Not much
more than that means not much searching.
By looking at the scalarSize and my PF usage - It does seem like it is
the memory issue, by loading too much data that the system can handle?
Situations like that - Does it matter if i do keys(%mainHash) = 300000
or not?

Well... preallocating too much will only make things worse.
Preallocating at all is a dodgy business (you have to calculate how much
to preallocate, which makes the code more complicated and thus more
fragile; and less readable), so don't do it if bucket allocation isn't
your problem, which it seems not to be.
Perhaps not - how do you turn on warnings?

Have you read the Posting Guidelines? They cover some basic stuff like
this. At the top of every script you write you should have

use strict;
use warnings;

which turn on a whole lot of diagnostics which are usually essential but
often annoying for one-liners.

Ben
 
X

xhoster

tak said:
At the time I print out scalar keys %mainHash - i had 240k entries in
it. (This is the one-big-hash implementation). And print scalar keys
%mainHash reports 160k/230k - not 16 and 23k... i missed out a 0... so
that means its about 1 entry per slot -- not too much collision?

Right, collisions are not your problem.
I printed out the scalar keys %mainHash, and each of the subhashes
AFTER all 240k entries were inserted. This is a cut and paste..

238348 lines in total..
Item: @ - Size: 9 - ScalarSize: 7/16
Item: A - Size: 12750 - ScalarSize: 8859/16384 ....
Item: Z - Size: 4167 - ScalarSize: 3277/8192
scalar of master_hash is: 27/524288

So it looks like you preallocted the master hash to have 300_000 buckets
(which got rounded up to the next power of two, 525288). There is
certainly no benefit in doing that, as the master hash only holds 27
things. Of course this is irrelevant as we now know there is no reason to
use the two-level has in the first place!

By looking at the scalarSize and my PF usage - It does seem like it is
the memory issue, by loading too much data that the system can handle?

Yes. Either loading to much data, or storing too much, or storing it in
an inefficient way.
Situations like that - Does it matter if i do keys(%mainHash) = 300000
or not?

It can only make things worse. You are creating 299973 buckets that take
up some amount of memory but which will never be used. If you revert to
the single hash implementation, then pre-allocating that giant hash will
not hurt. It might help, but probably not by a meaningful amount.
Preallocating hashes (or arrays) almost *never* helps in a meaningful way.
I've only seen it help twice in non-toy cases, and both of those were on
older versions of Perl--one with bad hashing function that degenerated when
given keys which were all a multiple of 3, and the other was a
bug/malfeature in the garbage collector.
Perhaps not - how do you turn on warnings?

The prefered way is to "use warnings;". On older Perls, you specify
-w on the command line or the shebang line.

#!/usr/bin/perl -w

Xho
 
A

anno4000

[...]
Say I have the key of subhash, as $letter, and the item in subhash as,
$value.

delete $hoh{$letter}{$value};

This should delete that from the hash, right?

Yes.
...

I find the question too vague to be answered unconditionally.

It deletes the entry associated with the key $value in the anonymous
hash $hoh{$letter}. It does not delete anything from %hoh, even if
the entry $value was the last one in $hoh{$letter}. That may or may
not matter for the program, but in another part of the thread the OP
seemed to expect the outer entry to go away. It doesn't without
help from the programmer.

Anno
 
A

anno4000

Ben Morrow said:
So, you are thrashing. You've run out of memory: I would suggest using
one of the DBM modules, probably DB_File. This stores the contents of
the hash in a (structured, binary, fast-to-index) file on disk, which
will probably make things faster.

But should it grow to almost two GB in the first place? We have no
idea how long the OPs lines are, but assuming something ordinary like
80 bytes that's a bit over 18 MB. Even Perl shouldn't blow it up
that much. Or is some other memory hog running? It sounds like the
program isn't doing much except reading the data. That shouldn't
bring a two-gig machine down, unless the lines are much longer,
(about 8 KB) of course.

Anno
 
I

Ilya Zakharevich

[A complimentary Cc of this posting was sent to

But should it grow to almost two GB in the first place? We have no
idea how long the OPs lines are, but assuming something ordinary like
80 bytes that's a bit over 18 MB. Even Perl shouldn't blow it up
that much. Or is some other memory hog running? It sounds like the
program isn't doing much except reading the data. That shouldn't
bring a two-gig machine down, unless the lines are much longer,
(about 8 KB) of course.

Note that hashes are quite memory-hungry (I would say circa 80bytes
per key total overhead - with a very good malloc() implementation).
So with a VERY LOUSY malloc() implementation (e.g., one which has 4K
minimal allocation unit), it could allocate two 4K chunks per key -
one for key, another for value. This is exactly 8K of your calculation.

And I actually SAW such malloc()s widely used - some kind of debugging
tool which would put all chunks on separate pages (so that bound
checking is simplified). Of course, we have no way to know that this
is what is used by the OP...

Hope this helps,
Ilya
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,009
Latest member
GidgetGamb

Latest Threads

Top