perl hash: low-level implementation details?

zikester · Nov 23, 2009

I've written a perl program that takes 3GB worth of key/value pairs
( each are numbers in the range of 0-60million ), and builds a hash
with them. The hash itself seems to be taking more than 130GB ( linux
64-bit ) and counting--I had to kill the program b/c it was growing
too large ( we have some 130Gb memory machines at our lab ).

Is there an article/book that describes the inner workings of perl
data structures / hashes in particular? I just want to know why it's
taking so much memory.

Thanks

Isaac

Jürgen Exner · Nov 23, 2009

zikester said:
I've written a perl program that takes 3GB worth of key/value pairs
( each are numbers in the range of 0-60million ), and builds a hash
with them. The hash itself seems to be taking more than 130GB ( linux
64-bit ) and counting--I had to kill the program b/c it was growing
too large ( we have some 130Gb memory machines at our lab ).

Is there an article/book that describes the inner workings of perl
data structures / hashes in particular? I just want to know why it's
taking so much memory.

One wild guess: the keys in a hash are always strings. I.e. even if you
use numbers Perl will use the string value of that number, not the
numerical value itself. Depending on the minimum size of a string in
perl this could add up quickly. How many key-value pairs are we talking
about?

Implementing a sparse array may be a better choice (is there a module
for that?).

jue

zikester · Nov 23, 2009

One wild guess: the keys in a hash are always strings. I.e. even if you
use numbers Perl will use the string value of that number, not the
numerical value itself. Depending on the minimum size of a string in
perl this could add up quickly. How many key-value pairs are we talking
about?

Implementing a sparse array may be a better choice (is there a module
for that?).

jue

Good point--we are talking about 510 million pairs of data. I
actually did reimplement this in C++ using a sparse array of vectors
and it ran in about 15Gb.

And good guess as to the hash value always being strings: is there
any documentation on the low-level workings on these types of topics?
I'd love to get more in-depth knowledge of what Perl is actually doing
at the "C" level.

Dan Rumney · Nov 23, 2009

Jürgen Exner said:
One wild guess: the keys in a hash are always strings. I.e. even if you
use numbers Perl will use the string value of that number, not the
numerical value itself. Depending on the minimum size of a string in
perl this could add up quickly. How many key-value pairs are we talking
about?

Implementing a sparse array may be a better choice (is there a module
for that?).

All numbers from 0 to 60m as strings is only going to add up to a few
tens of MB.

Perhaps there's a bug in your script. Can you provide the source code?
Or... at the very least, a simplified version that still results in the
same behaviour

Dan

Uri Guttman · Nov 23, 2009

z> Good point--we are talking about 510 million pairs of data. I
z> actually did reimplement this in C++ using a sparse array of vectors
z> and it ran in about 15Gb.

z> And good guess as to the hash value always being strings: is there
z> any documentation on the low-level workings on these types of topics?
z> I'd love to get more in-depth knowledge of what Perl is actually doing
z> at the "C" level.

hashes take up space for the key in string form and an SV for the
value and other overhead. you can check out the space usage with the
Devel::Size module.

since your numbers are all under 60M you can save some space by using a
packed 4 byte integer for the key instead of the stringified number
which can be up to 8 bytes long. this may not actually save any space
since the allocated string buffer will likely hold 8 bytes anyhow but it
is worth trying out. easy to check this out with devel::size on a small
data set.

and it would be very easy to make your own sparse matrix (i found
Math::MatrixSparse cpan but i don't know how efficient it is and it is
version .01). just write a simple hash for your numbers and index (mod
the array size) into a large array. handle collisions with one of
several easy algorithms (rehash until no collision, sequential scan for
next empty slot, or push to an array for that slot). that should lower
your storage needs by quite a bit as arrays use less ram than hashes.

if you are really tightfisted about space, you can even implement that
whole thing in a single (very large) scalar string. that will have
almost no perl overhead but you will need to use substr to index into
the slots (of your making) and deal with collisions and buckets (that
holds the actual value of a key). this isn't too hard but could be worth
it if you need the space.

finally, why aren't you using a disk based hash like some dbm flavor?
many exist and are pretty fast and do caching for speed. then the code
is just a tied hash and you no extra work. slower for sure than ram but
if you have lots of ram, it can do the job in decent time.

uri

zikester · Nov 23, 2009

>> >I've written a perl program that takes 3GB worth of key/value pairs
>> >( each are numbers in the range of 0-60million ), and builds a hash
>> >with them. The hash itself seems to be taking more than 130GB ( linux
>> >64-bit ) and counting--I had to kill the program b/c it was growing
>> >too large ( we have some 130Gb memory machines at our lab ).
>>
>> >Is there an article/book that describes the inner workings of perl
>> >data structures / hashes in particular? I just want to know why it's
>> >taking so much memory.
>>
>> One wild guess: the keys in a hash are always strings. I.e. even if you
>> use numbers Perl will use the string value of that number, not the
>> numerical value itself. Depending on the minimum size of a string in
>> perl this could add up quickly. How many key-value pairs are we talking
>> about?
>>
>> Implementing a sparse array may be a better choice (is there a module
>> for that?).
>>
>> jue

z> Good point--we are talking about 510 million pairs of data. I
z> actually did reimplement this in C++ using a sparse array of vectors
z> and it ran in about 15Gb.

z> And good guess as to the hash value always being strings: is there
z> any documentation on the low-level workings on these types of topics?
z> I'd love to get more in-depth knowledge of what Perl is actually doing
z> at the "C" level.

hashes take up space for the key in string form and an SV for the
value and other overhead. you can check out the space usage with the
Devel::Size module.

since your numbers are all under 60M you can save some space by using a
packed 4 byte integer for the key instead of the stringified number
which can be up to 8 bytes long. this may not actually save any space
since the allocated string buffer will likely hold 8 bytes anyhow but it
is worth trying out. easy to check this out with devel::size on a small
data set.

and it would be very easy to make your own sparse matrix (i found
Math::MatrixSparse cpan but i don't know how efficient it is and it is
version .01). just write a simple hash for your numbers and index (mod
the array size) into a large array. handle collisions with one of
several easy algorithms (rehash until no collision, sequential scan for
next empty slot, or push to an array for that slot). that should lower
your storage needs by quite a bit as arrays use less ram than hashes.

if you are really tightfisted about space, you can even implement that
whole thing in a single (very large) scalar string. that will have
almost no perl overhead but you will need to use substr to index into
the slots (of your making) and deal with collisions and buckets (that
holds the actual value of a key). this isn't too hard but could be worth
it if you need the space.

finally, why aren't you using a disk based hash like some dbm flavor?
many exist and are pretty fast and do caching for speed. then the code
is just a tied hash and you no extra work. slower for sure than ram but
if you have lots of ram, it can do the job in decent time.

uri

Thanks for the alternate approaches: I realize the impracticality of
my approach, but I just want to understand what is going on

The following simplified code generates 16Gb of RAM usage for me.
Since we have 50 million pairs, given a hash of size 50 million slots
we get about 320 bytes of memory usage per slot. Each slot on average
would have close to 1 member in it, so I don't think the string
explanation accounts for all of the memusage. I've heard strings
have a base overhead of a few dozen bytes ( i.e., a string of length 1
would still cost about 40 bytes ). If this is so, then perhaps the
fact that the keys and values are both strings ( 80 bytes ), the fact
that the hash may allocate more slots than it needs so as to avoid
collisions to keep the density around 50% ( x factor of 2 =160
bytes ), and the linked lists could be implemented as vectors which
themselves over allocate by a factor of 2 so as to reduce the need for
expensive mallocs() ( =320 bytes ), then maybe the usage isn't so far
fetched.

I will look into Devel::size right now

===============================================
use strict;

# simulate generating pairs

my %hash = ();
$|=1;

my $maxId = 60000000;
my $numPairs = 50000000;
for ( my $i=0; $i < $numPairs; $i++ )
{
if ( $i % 10000 == 0 ) { print "." };
my $key = rand() * $maxId;
my $value = $key;

if ( exists( $hash{ $key } ) )
{
push( @{$hash{ $key } }, $value );
}
else
{
$hash{ $key } = [ $value ];
}

Jürgen Exner · Nov 23, 2009

zikester said:
Good point--we are talking about 510 million pairs of data. I
actually did reimplement this in C++ using a sparse array of vectors
and it ran in about 15Gb.

And good guess as to the hash value always being strings:

NO!!!! The key is a string, the value can be any scalar!

jue

Uri Guttman · Nov 24, 2009

z> ===============================================
z> use strict;

add use warnings!

z> # simulate generating pairs

z> my %hash = ();

no need to initialize a hash to the emtpy list. my does that for you.

z> $|=1;

z> my $maxId = 60000000;
z> my $numPairs = 50000000;
z> for ( my $i=0; $i < $numPairs; $i++ )

for my $i ( 1 .. $numPairs ) {

recent perls will optimize that to a loop and not build the full list.

z> {
z> if ( $i % 10000 == 0 ) { print "." };
z> my $key = rand() * $maxId;

that is more commonly done as rand( $maxid ). it generates the same
results.

z> my $value = $key;

z> if ( exists( $hash{ $key } ) )
z> {
z> push( @{$hash{ $key } }, $value );
z> }
z> else
z> {
z> $hash{ $key } = [ $value ];
z> }

bah! you don't know about autovivification! the push line on its own
will work find. perl will make an anon array for you when it sees an
undef used as a reference. i wrote this years ago and post it here on a
semiregular basis. go read it!

http://sysarch.com/Perl/autoviv.txt

that whole bottom part could just be:

my $key = rand( $maxid )
push( @{$hash{ $key } }, $key );

uri

Xho Jingleheimerschmidt · Nov 24, 2009

zikester said:
....

Good point--we are talking about 510 million pairs of data.

3GB/510 million pairs/ 2 things per pair = less than 3 bytes per number.

You can't store up to 60 million using only 3 bytes.

Also, you can't have 510 million distinct keys in the range of 0-60
million, unless you are talking floats. How are you dealing with
duplicated keys?

That is what Perl does--throw memory at every possible problem.

And good guess as to the hash value always being strings: is there
any documentation on the low-level workings on these types of topics?

I don't think there is. The philosophy seems to be:

1) If you have to ask, Perl might not be the right tool for this
particular problem.

2) Use the source, Luke.

I'd love to get more in-depth knowledge of what Perl is actually doing
at the "C" level.

Perl is open source. You can't get much closer to the C level than the
actual C.

Xho

Ilya Zakharevich · Nov 24, 2009

I've written a perl program that takes 3GB worth of key/value pairs
( each are numbers in the range of 0-60million ), and builds a hash
with them. The hash itself seems to be taking more than 130GB ( linux
64-bit ) and counting--I had to kill the program b/c it was growing
too large ( we have some 130Gb memory machines at our lab ).

Is there an article/book that describes the inner workings of perl
data structures / hashes in particular? I just want to know why it's
taking so much memory.

I suspect your Perl may be broken. The typical overhead is below 100B
per key (on 32-bit machines) + extra malloc overhead. So 5e8 keys
should not take more than 500M * 200B = 100GB overhead on a 64-bit
machine.

Did you try to use "my" malloc when you configured Perl?

Ilya

Owen · Nov 24, 2009

z> ===============================================
z> use strict;

add use warnings!

z> # simulate generating pairs

z> my %hash = ();

no need to initialize a hash to the emtpy list. my does that for you.

z> $|=1;

z> my $maxId = 60000000;
z> my $numPairs = 50000000;
z> for ( my $i=0; $i < $numPairs; $i++ )

for my $i ( 1 .. $numPairs ) {

recent perls will optimize that to a loop and not build the full list.

z> {
z> if ( $i % 10000 == 0 ) { print "." };
z> my $key = rand() * $maxId;

that is more commonly done as rand( $maxid ). it generates the same
results.

z> my $value = $key;

z> if ( exists( $hash{ $key } ) )
z> {
z> push( @{$hash{ $key } }, $value );
z> }
z> else
z> {
z> $hash{ $key } = [ $value ];
z> }

bah! you don't know about autovivification! the push line on its own
will work find. perl will make an anon array for you when it sees an
undef used as a reference. i wrote this years ago and post it here on a
semiregular basis. go read it!

http://sysarch.com/Perl/autoviv.txt

that whole bottom part could just be:

my $key = rand( $maxid )
push( @{$hash{ $key } }, $key );

uri

So I tried all that, with $numPairs = 500000, and watching htop, it
took 100 MB of memory. With the number increased tenfold to 5000000,
memory usage went to 1000 MB. My conclusion 200 MB per million pairs,
or 100 GB for 500 million pairs

#!/usr/bin/perl

use strict;
use warnings;

my %hash;
$| = 1;

my $maxId = 60000000;
my $numPairs = 500000;

for my $i ( 1 .. $numPairs ) {
my $key = rand($maxId);
push( @{ $hash{$key} }, $key );
}

Owen

Xho Jingleheimerschmidt · Nov 24, 2009

zikester said:
Thanks for the alternate approaches: I realize the impracticality of
my approach, but I just want to understand what is going on

The following simplified code generates 16Gb of RAM usage for me.
Since we have 50 million pairs, given a hash of size 50 million slots
we get about 320 bytes of memory usage per slot. Each slot on average
would have close to 1 member in it,

Each value slot will have exactly one value in it--that is how Perl
hashes work. However, in you code that value will be a reference to an
array, which array will on average have close to 1 element in it.

And there goes your memory. You have about 50 million tiny arrays, each
one using a lot of overhead.

I'd guess roughly it comes up to something like: 48 bytes for the key
and associated structure, 40 bytes for the value-scalar (which holds an
arrayref), 160 bytes for the array overhead, and 48 bytes for each
scalar (usually 1) inside each array.

so I don't think the string
explanation accounts for all of the memusage. I've heard strings
have a base overhead of a few dozen bytes ( i.e., a string of length 1
would still cost about 40 bytes ).

On 64 bit machine, I think this is pretty close.

If this is so, then perhaps the
fact that the keys and values are both strings ( 80 bytes ), the fact
that the hash may allocate more slots than it needs so as to avoid
collisions to keep the density around 50% ( x factor of 2 =160
bytes ),

But it doesn't work like that. I think an unused hash slot is only the
size of a pointer. The slot itself doesn't have the storage for the key
or the value (and if it did, how would it know the size of your future
key and/or value?)

Xho

Peter Scott · Nov 24, 2009

I've written a perl program that takes 3GB worth of key/value pairs (
each are numbers in the range of 0-60million ), and builds a hash with
them. The hash itself seems to be taking more than 130GB ( linux 64-bit
) and counting--I had to kill the program b/c it was growing too large (
we have some 130Gb memory machines at our lab ).

Is there an article/book that describes the inner workings of perl data
structures / hashes in particular? I just want to know why it's taking
so much memory.

To find out more about Perl's internal data representations, google
"perlguts illustrated".

To store a lot of numbers compactly, consider the PDL module (http://
pdl.perl.org/).

Ted Zlatanov · Nov 24, 2009

XJ> Each value slot will have exactly one value in it--that is how Perl
XJ> hashes work. However, in you code that value will be a reference to
XJ> an array, which array will on average have close to 1 element in it.

XJ> And there goes your memory. You have about 50 million tiny arrays,
XJ> each one using a lot of overhead.

XJ> I'd guess roughly it comes up to something like: 48 bytes for the key
XJ> and associated structure, 40 bytes for the value-scalar (which holds
XJ> an arrayref), 160 bytes for the array overhead, and 48 bytes for each
XJ> scalar (usually 1) inside each array.

In fact my #1 recommendation after the above discussion would be to make
your values strings instead of arrays with some fixed delimiter (e.g. a
0 byte can be used with pack/unpack). If the values can be packed more
efficiently in binary (e.g. they are always floats), do that in the
value.

I would use SQLite to do this, however. With an indexed primary key,
the lookup will be very fast and, more importantly, can be parallelized.
Running more than one core is very valuable with this kind of large
problem. You can even make a RAM disk and keep the database file there.
It may be slightly slower than a pure Perl solution, but I would
benchmark it regardless.

There are also dedicated key-value databases like TDB and many many
others that are designed for large data sets and scale well across
multiple machines. What you should use depends on the end purpose of
your key-value store.

Ted

Ilya Zakharevich · Nov 25, 2009

I'd guess roughly it comes up to something like: 48 bytes for the key
and associated structure, 40 bytes for the value-scalar (which holds an
arrayref), 160 bytes for the array overhead, and 48 bytes for each
scalar (usually 1) inside each array.

??? Where do you get these numbers? On 32-bit machine, it is 16 bytes
per scalar (more for strings), and 40 bytes + 4*num_elts for array
envelop.

Ilya

Xho Jingleheimerschmidt · Nov 25, 2009

Ilya said:
??? Where do you get these numbers?

Memory of past experiments, and a little guessing.

On 32-bit machine,

He isn't using a 32 bit machine.

Xho

Ilya Zakharevich · Nov 26, 2009

I suspect your Perl may be broken. The typical overhead is below 100B
per key (on 32-bit machines) + extra malloc overhead. So 5e8 keys
should not take more than 500M * 200B = 100GB overhead on a 64-bit
machine.

Hmm, from other postings I see that you use hash of arrays. Then
130GB does not look excessive...

Did you try to use "my" malloc when you configured Perl?

.... Which makes *this* suggestion less appealing... On the other
hand, if your optimized version is close to the top limit of memory,
using "my" malloc may help.

Yours,
Ilya

Ilya Zakharevich · Nov 26, 2009

Memory of past experiments, and a little guessing.

Probably "a faulty memory of past experiments"...

He isn't using a 32 bit machine.

.... so multiplying by 2 gives an upper bound...

Xho Jingleheimerschmidt · Nov 28, 2009

Ilya said:
Probably "a faulty memory of past experiments"...

I've since repeated some experiments on a 64 bit machine. I greatly
underestimated the size of the hash keys (and associated overhead).
They take a minimum of 88 bytes, and obviously more if they are long.
(but less if they are shared with other hashes). My original estimate
of the array overhead included the scalars to hold the array references
(and a single array to hold those scalars), as you can't have millions
of arrays without some way to hold them. So I was double counting that
part.

The values for small arrays are quite wobbly, because they are path
dependent. For example, pushing onto an autoviv takes more room than
explicitly assigning an arrayref, that is
push @{$x{$_}}, 1, 2;
versus
$x{$_}=[1,2];

Also, adding the first element to previously empty array takes a lot
more space than adding additional elements, so unless you plan on having
lots of references to empty arrays, you would want to benchmark the size
of a single-element array and then subtract the size of the element (or
create the array as a single element, then delete that element, inside
the loop), rather virginal arrays.

.... so multiplying by 2 gives an upper bound...

Maybe, maybe not. I get better results with the experimental method
rather than trying to do it all from first principles, and inevitably
overlooking things.

Xho

Xho Jingleheimerschmidt · Nov 28, 2009

David said:
On Mon, 23 Nov 2009 21:23:14 -0800 in comp.lang.perl.misc, Xho

I'm thinking a good way to store it would be the actual value in the
hash as long as there is only one for that key, or an array reference as
soon as it grows to more than one.

Been there, done that. But I wouldn't recommend it. If every store and
fetch has to have special case code anyway, and as long as you can pick
a character guaranteed not to be in the values, I'd just join the
multiple values on the magic character on storage and split on that
character on retrieval. That way you get the space savings even if many
or most values end up being multivalued.

I guess it depends on what assumption is least likely to be violated in
the future, the one that the vast majority of entries will remain
single-valued, or the one that your magic character will remain magic.

Or maybe pack and unpack would be better than join/concat and split,
depending on the nature of the values.

You would need a sub to insert and
to access, to keep complexity under control. At which point maybe it
makes sense to make it a class module. Am I going down a bad path here?

I'm not sure what you mean by a class module. An OO module? I think
that if you need to optimize for space today, then there is a good
chance you will need to optimize for speed tomorrow. So I'd be somewhat
reluctant to start routing all of the accesses (probably in the
innermost loop) through OO calls. But I guess it would depend on how
many different places in the code needed to access this, and on whether
I already knew that something else was going to be irreducible bottleneck.

Xho

compiling perl 5.8.7 on Solaris 8	3	Nov 17, 2005
REQ: Perl 5.8.3 on OpenBSD	3	Mar 6, 2004
Better crypto hash functions, long, with code	2	Aug 26, 2005
The Semantics of 'volatile'	73	Jun 2, 2009
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	15	Apr 1, 2006
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Mar 1, 2008
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Jan 12, 2008
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Dec 15, 2007

perl hash: low-level implementation details?

zikester

Jürgen Exner

zikester

Dan Rumney

Uri Guttman

zikester

Jürgen Exner

Uri Guttman

Xho Jingleheimerschmidt

Ilya Zakharevich

Owen

Xho Jingleheimerschmidt

Peter Scott

Ted Zlatanov

Ilya Zakharevich

Xho Jingleheimerschmidt

Ilya Zakharevich

Ilya Zakharevich

Xho Jingleheimerschmidt

Xho Jingleheimerschmidt

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads