Large Dictionaries

Chris Foote · May 15, 2006

Hi all.

I have the need to store a large (10M) number of keys in a hash table,
based on a tuple of (long_integer, integer). The standard python
dictionary works well for small numbers of keys, but starts to
perform badly for me inserting roughly 5M keys:

# keys dictionary metakit (both using psyco)
------ ---------- -------
1M 8.8s 22.2s
2M 24.0s 43.7s
5M 115.3s 105.4s

Has anyone written a fast hash module which is more optimal for
large datasets ?

p.s. Disk-based DBs are out of the question because most
key lookups will result in a miss, and lookup time is
critical for this application.

Cheers,
Chris

Richie Hindle · May 15, 2006

[Chris]

Has anyone written a fast hash module which is more optimal for
large datasets ?

PyJudy might be what you're looking for, though I've never used it:

http://www.dalkescientific.com/Python/PyJudy.html

"Judy's key benefits are scalability, high performance, and memory
efficiency. A Judy array is extensible and can scale up to a very large
number of elements, bounded only by machine memory." ... "PyJudy arrays
are similar to Python dictionaries and sets."

Roy Smith · May 15, 2006

Chris Foote <[email protected]> said:
Hi all.

I have the need to store a large (10M) number of keys in a hash table,
based on a tuple of (long_integer, integer). The standard python
dictionary works well for small numbers of keys, but starts to
perform badly for me inserting roughly 5M keys:

# keys dictionary metakit (both using psyco)
------ ---------- -------
1M 8.8s 22.2s
2M 24.0s 43.7s
5M 115.3s 105.4s

Are those clock times or CPU times?

How much memory is your process using and how much is available on your
machine?

I'm guessing a integer takes 4 bytes and a long integer takes roughly one
byte per two decimal digits. Plus a few more bytes to bundle them up into
a tuple. You've probably got something like 20 bytes per key, so 5M of
them is 100 meg just for the keys.

To get reasonable hashing performance, the hash table needs to be maybe
half full, and figure a hash key is 4 bytes, so that's another 40 meg for
the hash table itself.

Plus whatever the values stored in the dictionary take up. Even if you're
not storing any values (i.e., there's probably 4 bytes for a null pointer
(or reference to None), so that's another 40 meg.

These are all vague guesses, based on what I think are probably
conservative estimates of what various objects must take up in memory, but
you see where this is going. We're already up to 180 meg. I wonder if the
whole problem is that you're just paging yourself to death. A few minutes
watching your system memory performance with ps or top while your program
is running might give you some idea if this is the case.

Claudio Grondi · May 15, 2006

Chris said:
Hi all.

I have the need to store a large (10M) number of keys in a hash table,
based on a tuple of (long_integer, integer). The standard python
dictionary works well for small numbers of keys, but starts to
perform badly for me inserting roughly 5M keys:

# keys dictionary metakit (both using psyco)
------ ---------- -------
1M 8.8s 22.2s
2M 24.0s 43.7s
5M 115.3s 105.4s

Has anyone written a fast hash module which is more optimal for
large datasets ?

p.s. Disk-based DBs are out of the question because most
key lookups will result in a miss, and lookup time is
critical for this application.

Cheers,
Chris

Python Bindings (\Python24\Lib\bsddb vers. 4.3.0) and the DLL for
BerkeleyDB (\Python24\DLLs\_bsddb.pyd vers. 4.2.52) are included in the
standard Python 2.4 distribution.

"Berkeley DB was 20 times faster than other databases. It has the
operational speed of a main memory database, the startup and shut down
speed of a disk-resident database, and does not have the overhead of
a client-server inter-process communication."
Ray Van Tassle, Senior Staff Engineer, Motorola

Please let me/us know if it is what you are looking for.

Claudio

Aahz · May 15, 2006

Are those clock times or CPU times?

And what are these times measuring? Don't forget that adding keys
requires resizing the dict, which is a moderately expensive operation.
Once the dict is constructed, lookup times should be quite good.

Roy Smith · May 15, 2006

Aahz said:
Don't forget that adding keys requires resizing the dict, which is a
moderately expensive operation.

My guess would be that each resize grows the table by a constant
factor, which IIRC, works out to amortized O(n). It's not as good as
creating the dict the right size in the first place, but it's really
not that bad. Somewhat more troubling is that it can lead to memory
fragmentation problems.

I don't understand why Python doesn't have a way to give a size hint
when creating a dict. It seems so obvious to be able to say "d = dict
(size=10*1000*1000)" if you know beforehand that's how many keys
you're going to add.

Looking at the docs, I'm astounded to discover that you can indeed do
exactly that, but you get {size:10000000}. I am befuddled as to why
people thought creating a dict by keyword would be a useful thing to
do, but left out (and, indeed, eliminated the chance to add the syntax
later) the obviously useful ability to hint at the size. I suppose we
could still add:

d = {}
d.reserve (10*1000*1000)

Maric Michaud · May 15, 2006

Le Lundi 15 Mai 2006 19:24, Roy Smith a écrit :

d = {}
d.reserve (10*1000*1000)

d={}.fromkeys(xrange(5*10**6)) ?

--
_____________

Maric Michaud
_____________

Aristote - www.aristote.info
3 place des tapis
69004 Lyon
Tel: +33 426 880 097

Paul McGuire · May 15, 2006

Claudio Grondi said:
Python Bindings (\Python24\Lib\bsddb vers. 4.3.0) and the DLL for
BerkeleyDB (\Python24\DLLs\_bsddb.pyd vers. 4.2.52) are included in the
standard Python 2.4 distribution.

"Berkeley DB was 20 times faster than other databases. It has the
operational speed of a main memory database, the startup and shut down
speed of a disk-resident database, and does not have the overhead of
a client-server inter-process communication."
Ray Van Tassle, Senior Staff Engineer, Motorola

Please let me/us know if it is what you are looking for.

Claudio

sqlite also supports an in-memory database - use pysqlite
(http://initd.org/tracker/pysqlite/wiki) to access this from Python.

-- Paul

Diez B. Roggisch · May 15, 2006

Maric said:
Le Lundi 15 Mai 2006 19:24, Roy Smith a écrit :

d={}.fromkeys(xrange(5*10**6)) ?

That is a totally different beast. You don't want to insert arbitrary
keys, you want the internal hash-array to be of the right size.

Diez

bearophileHUGS · May 15, 2006

Roy Smith:

I am befuddled as to why
people thought creating a dict by keyword would be a useful thing to
do, but left out (and, indeed, eliminated the chance to add the syntax
later) the obviously useful ability to hint at the size.

Adding the keyword syntax doesn't require much memory to a Python
programmer because it's the same syntax used elsewhere. While adding
another different method to dicts increases their API. Python is
usually designed to have smaller APIs, to help programmers remember
most things.

I agree that a reserve method to Python dicts/sets can be a little
useful, but Python programmers are usually more concerned about doing
things in a short programmer time and short code space, than telling
the computer how to do such things in the faster way (C++ is more for
speed so the reserve of the STL hashes can be more useful).

Bye,
bearophile

John Machin · May 16, 2006

1. Why is two minutes to insert 5M keys "bad" for you? What would be
"good"? What would be good/bad look-up times? Have you measured the
typical look-up time? How often is the dict creation required to be
done? How often does the data change? Is multi-user access required for
(a) look-up (b) updating? Have you considered loading the dict from a
pickle?
2. Assuming your code that is creating the dict looks in essence like
this:
adict = {}
for k, v in some_iterable:
adict[k] = v
then any non-linear behaviour can only be in the actual CPython
insertion code. Psyco can't help you there. Psyco *may* help with the
linear part, *if* you have enough memory. What are the corresponding
times without Psyco? In any case, if your code isn't (conceptually)
that simple, then try cutting away the cruft and measuring again.
3. Which version of Python? What OS? OK, psyco -> Intel x86, but what
chip exactly? How much free memory?
4. Consider printing time-so-far results, say every 100K keys. Multiple
step-ups might indicate dict resizings. A dog-leg probably means
running out of memory. Why "roughly" 5M keys???
5. How large are your long_integers?
6. What is the nature of the value associated with each key?
7. Have you experimented with key = a * 2 ** 32 + b instead of key =
(a, b)?

HTH,
John

lcaamano · May 16, 2006

Sounds like PyTables could be useful.

http://www.pytables.org

Chris Foote · May 16, 2006

Roy said:
Are those clock times or CPU times?

User + system CPU time.

How much memory is your process using and how much is available on your
machine?

A very large amount of space is consumed, which is why I didn't give
times for the 10M version which would have eaten my 1G of RAM and
into swap

I'm guessing a integer takes 4 bytes and a long integer takes roughly one
byte per two decimal digits. Plus a few more bytes to bundle them up into
a tuple. You've probably got something like 20 bytes per key, so 5M of
them is 100 meg just for the keys.

To get reasonable hashing performance, the hash table needs to be maybe
half full, and figure a hash key is 4 bytes, so that's another 40 meg for
the hash table itself.

Plus whatever the values stored in the dictionary take up. Even if you're
not storing any values (i.e., there's probably 4 bytes for a null pointer
(or reference to None), so that's another 40 meg.

These are all vague guesses, based on what I think are probably
conservative estimates of what various objects must take up in memory, but
you see where this is going. We're already up to 180 meg.

Yep, that size sounds about right (my dictionary values are all None).

I wonder if the
whole problem is that you're just paging yourself to death. A few minutes
watching your system memory performance with ps or top while your program
is running might give you some idea if this is the case.

Definitely not.

Thanks anyway,

Chris

Chris Foote · May 16, 2006

Aahz said:
And what are these times measuring?

The loading of a file into a dictionary. i.e. no lookup operations.

> Don't forget that adding keys
requires resizing the dict, which is a moderately expensive operation.

Yep, that's why I probably need a dictionary where I can pre-specify
an approximate size at the time of its creation.

Once the dict is constructed, lookup times should be quite good.

Very good indeed with Psyco!

Cheers,
Chris

Maric Michaud · May 16, 2006

Le Lundi 15 Mai 2006 21:07, Diez B. Roggisch a écrit :

That is a totally different beast. You don't want to insert arbitrary
keys, you want the internal hash-array to be of the right size.

But you can replace the xrange part by any generator function you want.

def get_mykeys(..)
...
yield key

I just the wonder if the time consuming part is the memory allocation of hash
table (key by key) or the hash operation itself.

I don't see a use case where a python programmer should need a
dictionnary "that will be probably big" but can't predict what keys will be
in.

--
_____________

Maric Michaud
_____________

Aristote - www.aristote.info
3 place des tapis
69004 Lyon
Tel: +33 426 880 097

Duncan Booth · May 16, 2006

Chris said:
Yep, that's why I probably need a dictionary where I can pre-specify
an approximate size at the time of its creation.

Try splitting the creation of the keys from the creation of the dictionary
and you'll see where the bottleneck really is.

takes less than 1 second to create the dictionary. So creating a dictionary
with 5 million elements is not in itself a terribly expensive operation
(and yes, even with fromkeys there is no attempt to work out in advance
what size the dictionary should be).

Trying to simulate your data:

scale = 2**32
data = [ (i*scale,i) for i in range(5000000) ]
d = dict.fromkeys(data)

Click to expand...

Click to expand...

the assignment to data took 42 seconds. Creating the dictionary took 6
seconds.

Trying the variation someone else suggested:

scale = 2**32
data = [ i*scale+i for i in range(5000000) ]
d = dict.fromkeys(data)

Click to expand...

Click to expand...

creating data took under 6 seconds and about 1 second for d.

so it looks like the issue is not creating a large dictionary, nor lots of
integers, but creating lots of tuples is expensive.

Oh, and if anyone tells you that the intermediate list I created in the
tests above is going to make it slow and you should use iterators instead:

r = range(5000000)
scale = 2**32
s = [ i*scale for i in r ]
from itertools import izip
d = dict.fromkeys(izip(s,r))

Click to expand...

Click to expand...

The first few lines took a couple of seconds, the last one 1 minute 50
seconds. The alternative version:

takes a similar time.

Chris Foote · May 16, 2006

Claudio said:
Python Bindings (\Python24\Lib\bsddb vers. 4.3.0) and the DLL for
BerkeleyDB (\Python24\DLLs\_bsddb.pyd vers. 4.2.52) are included in the
standard Python 2.4 distribution.

However, please note that the Python bsddb module doesn't support
in-memory based databases - note the library documentation's[1] wording:

"Files never intended to be preserved on disk may be created by
passing None as the filename."

which closely mirrors the Sleepycat documentation[2]:

"In-memory databases never intended to be preserved on disk may be
created by setting the file parameter to NULL."

It does actually use a temporary file (in /var/tmp), for which
performance for my purposes is unsatisfactory:

# keys dictionary metakit bsddb (all using psyco)
------ ---------- ------- -----
1M 8.8s 22.2s 20m25s[3]
2M 24.0s 43.7s N/A
5M 115.3s 105.4s N/A

Cheers,
Chris

[1] bsddb docs:
http://www.python.org/doc/current/lib/module-bsddb.html

[2] Sleepycat BerkeleyDB C API:
http://www.sleepycat.com/docs/api_c/db_open.html

[3] Wall clock time. Storing the (long_integer, integer) key in string
form "long_integer:integer" since bsddb doesn't support keys that aren't
integers or strings.

Chris Foote · May 16, 2006

Paul said:
sqlite also supports an in-memory database - use pysqlite
(http://initd.org/tracker/pysqlite/wiki) to access this from Python.

Hi Paul.

I tried that, but the overhead of parsing SQL queries is too high:

dictionary metakit sqlite[1]
---------- ------- ---------
1M numbers 8.8s 22.2s 89.6s
2M numbers 24.0s 43.7s 190.0s
5M numbers 115.3s 105.4s N/A

Thanks for the suggestion, but no go.

Cheers,
Chris

[1] pysqlite V1 & sqlite V3.

Chris Foote · May 16, 2006

lcaamano said:
Sounds like PyTables could be useful.

http://www.pytables.org

In browsing their excellent documentation, it seems that all concepts
are built around storing and reading HDF5 format files.

Not suitable for this project unfortunately.

Cheers,
Chris

John Machin · May 16, 2006

(my dictionary values are all None).

So in fact all you need is a set. Have you experimented with the Python
2.5 alpha?

ZODB memory problems (was: processing a Very Large file)	1	May 21, 2005
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	15	Apr 1, 2006

Large Dictionaries

Chris Foote

Richie Hindle

Roy Smith

Claudio Grondi

Aahz

Roy Smith

Maric Michaud

Paul McGuire

Diez B. Roggisch

bearophileHUGS

John Machin

lcaamano

Chris Foote

Chris Foote

Maric Michaud

Duncan Booth

Chris Foote

Chris Foote

Chris Foote

John Machin

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads