Large Dictionaries

Vladimir 'Yu' Stepanov · May 16, 2006

Chris said:
In browsing their excellent documentation, it seems that all concepts
are built around storing and reading HDF5 format files.

Not suitable for this project unfortunately.

Cheers,
Chris

How many values accept long and int in pairs ? It is possible to try to use
the dictionary of dictionaries:

d = {}
for long_val, int_val in list_vals:
d.setdefault(long_val, {})[int_val] = None

Duncan Booth · May 16, 2006

John said:
So in fact all you need is a set. Have you experimented with the Python
2.5 alpha?

I don't think that will help him: my timings work out about the same on
2.4.2 or 2.5a2. As I pointed out, the bottleneck is creating the tuples.
Creating either a dictionary or a set from 5 million tuples takes a similar
amount of time (about 6 seconds on my system).

BTW, a string key is quite a bit faster than the tuples, even though you
end up with as many temporary tuples during the process:

scale = 2**32
data = [ "%d:%d" % (i*scale,i) for i in range(5000000) ]
d = set(data)

Click to expand...

Click to expand...

takes about 18 seconds to create the data list and 5 for the set.

Chris Foote · May 16, 2006

John said:
1. Why is two minutes to insert 5M keys "bad" for you?
What would be "good"?

Hi John.

I'd ideally like to have O(1) performance for the number of keys
inserted. If linear, then 5M insertions taking the same time as
5 x 1M keys, or 44.0s. I see more than double that:

# keys dictionary metakit
------ ---------- -------
1M 8.8s 22.2s
2M 24.0s 43.7s
5M 115.3s 105.4s

Maybe the dict resizing is the cause, in which case a dict
type which allows an approximate size to be pre-specified
would probably negate the need for resizing.

What would be good/bad look-up times? Have you measured the
typical look-up time?

The lookup time from all in-memory databases I've tried
is good (I need >15,000 lookups per second).

How often is the dict creation required to be
done?

At least once a day upon the process receiving a signal.
I was planning on using a separate thread to load a
temporary dictionary, and swap the old one for the new
one upon completion; that might be good in theory, but
I need a predictable load time over multiple machines,
which is what prompted me to ask about alternate hash
table libraries that are more suitable for large datasets.

How often does the data change?

Almost never, except during a full reload perhaps once
an hour through to once a day.

Is multi-user access required for
(a) look-up (b) updating?

Lookup - yes.
Updating - no.

Have you considered loading the dict from a

pickle?

The pickling process consumes too much RAM to
save even a dict containing 1M keys (> 1 GB).

2. Assuming your code that is creating the dict looks in essence like
this:
adict = {}
for k, v in some_iterable:
adict[k] = v

I'm using tuples as keys, with None for the data - it's roughly
like this:

class NumberDB(object):
def __init__(self):
self.numbers = {}
def add(number, wildcard_digits):
self.numbers[(number, wildcard_digits)] = None

where add() is called after converting a line of text from a file
into integers to provide as arguments. i.e. keys are tuple of
form (long_integer, integer).

then any non-linear behaviour can only be in the actual CPython
insertion code. Psyco can't help you there. Psyco *may* help with the
linear part, *if* you have enough memory. What are the corresponding
times without Psyco?

Almost double, because I'm populating the dictionary from
a file containing strings which I have to convert.

In any case, if your code isn't (conceptually)
that simple, then try cutting away the cruft and measuring again.
3. Which version of Python? What OS? OK, psyco -> Intel x86, but what
chip exactly? How much free memory?

My devel machine, used for this testing is:

Python 2.4.2 (#1, Feb 12 2006, 03:59:46)
[GCC 4.1.0 20060210 (Red Hat 4.1.0-0.24)] on linux2
Fedora Core 5 i686
CPU: AMD Athlon(tm) 64 Processor 2800+
1G RAM.

4. Consider printing time-so-far results, say every 100K keys. Multiple
step-ups might indicate dict resizings. A dog-leg probably means
running out of memory.

see attached GIF for this. The X axis represents 100k increments
and the Y axis the wall clock duration from startup.

Why "roughly" 5M keys???

Sorry, that was just a guess

5. How large are your long_integers?

They're phone numbers (tests samples are 11 digits).

6. What is the nature of the value associated with each key?

Not used, so they're set to None.

7. Have you experimented with key = a * 2 ** 32 + b instead of key =
(a, b)?

No I haven't. I'll give that a try and report back, especially after
Duncan's posts suspecting tuple overhead.

Cheers,
Chris

Claudio Grondi · May 16, 2006

Chris said:
Claudio said:

Python Bindings (\Python24\Lib\bsddb vers. 4.3.0) and the DLL for
BerkeleyDB (\Python24\DLLs\_bsddb.pyd vers. 4.2.52) are included in
the standard Python 2.4 distribution.

Click to expand...

However, please note that the Python bsddb module doesn't support
in-memory based databases - note the library documentation's[1] wording:

"Files never intended to be preserved on disk may be created by
passing None as the filename."

which closely mirrors the Sleepycat documentation[2]:

"In-memory databases never intended to be preserved on disk
may be created by setting the file parameter to NULL."

It does actually use a temporary file (in /var/tmp), for which
performance for my purposes is unsatisfactory:

# keys dictionary metakit bsddb (all using psyco)
------ ---------- ------- -----
1M 8.8s 22.2s 20m25s[3]
2M 24.0s 43.7s N/A
5M 115.3s 105.4s N/A

Cheers,
Chris

[1] bsddb docs:
http://www.python.org/doc/current/lib/module-bsddb.html

[2] Sleepycat BerkeleyDB C API:
http://www.sleepycat.com/docs/api_c/db_open.html

[3] Wall clock time. Storing the (long_integer, integer) key in string
form "long_integer:integer" since bsddb doesn't support keys that aren't
integers or strings.

I have to admit, that I haven't wrote any own code to actually test
this, but if 20m25s for storing of a single MByte of strings in a
database table index column is really what you are getting, I can't get
rid of the feeling, that there is something elementary wrong with your
way doing it.

Posting the code for your test cases appears to me to be the only option
to see what is the reason for the mystery you are getting here (this
will clarify also the other mysterious things considered by the posters
to this thread up to now).

Claudio

jantod · May 16, 2006

BTW why are python dicts implemented as hash tables and not judy arrays?

Duncan Booth · May 16, 2006

BTW why are python dicts implemented as hash tables and not judy
arrays?

Probably because:

Python predates judy arrays.
Nobody has proposed replacing the current dict implementation.
Nobody has demonstrated that it would actually be an advantage to
replace the current implementation.
Or maybe Python dicts are just more flexible?

You choose.

A quick scan of the Judy IV Shop Manual reveals:

Judy is a dessert topping and a floor wax, but it’s not intended for
use as a house paint. ....
Judy is a programming library that provides a relatively simple
interface (API) for array-like storage of word or string indexes with
optional mapping of indexes to single-word values.

It doesn't sound to me as though they match the flexibility of Python
dictionaries. Without reading further through the documentation, the
implication is that keys are restricted to simple word or string values.
Remember, all that Python requires of a dict key is that it has a hash
value and can be compared for equal/not equal with other keys. It
doesn't require the key to be expressible as a sequence of bits which is
what the Judy manual seems to imply:

Duncan Booth · May 16, 2006

Duncan said:
Probably because:

Python predates judy arrays.
Nobody has proposed replacing the current dict implementation.
Nobody has demonstrated that it would actually be an advantage to
replace the current implementation.
Or maybe Python dicts are just more flexible?

Also, see http://www.dalkescientific.com/Python/PyJudy.html with the
conclusions drawn on performance:

The first conclusion is that Python dictionaries are wicked fast.
There are few cases where PyJudy is faster, though perhaps there might
be a few more if I knew more about the Python extension API. Bear in
mind though that the Judy arrays are in sorted order and the JudyL*
classes have ways to access elements by order.

So for a specialised use Judy arrays could be a good idea, but not as a
general replacement for Python dicts.

Sion Arrowsmith · May 16, 2006

Maric Michaud said:
I don't see a use case where a python programmer should need a
dictionnary "that will be probably big" but can't predict what keys will be
in.

I've recently been working on an app with precisely this
characteristic, although there were enough other bottlenecks that the
cost of growing the dict as needed was irrelevant. (Read ~2m items
from csv file, compare them with items in a database and populate
assorted dictionaries based on the result of that comparison for
subsequent processing. You can make a reasonable guess, and have a
definite upper bound, on the sizes of the dictionaries cheaply right
at the start, but you don't know what the keys in each are going to be
until you discover them through the comparison of file and DB.)

Chris Foote · May 16, 2006

Claudio said:
Chris said:

However, please note that the Python bsddb module doesn't support
in-memory based databases - note the library documentation's[1] wording:

"Files never intended to be preserved on disk may be created
by passing None as the filename."

which closely mirrors the Sleepycat documentation[2]:

"In-memory databases never intended to be preserved on disk
may be created by setting the file parameter to NULL."

It does actually use a temporary file (in /var/tmp), for which
performance for my purposes is unsatisfactory:

# keys dictionary metakit bsddb (all using psyco)
------ ---------- ------- -----
1M 8.8s 22.2s 20m25s[3]
2M 24.0s 43.7s N/A
5M 115.3s 105.4s N/A

Cheers,
Chris

[1] bsddb docs:
http://www.python.org/doc/current/lib/module-bsddb.html

[2] Sleepycat BerkeleyDB C API:
http://www.sleepycat.com/docs/api_c/db_open.html

[3] Wall clock time. Storing the (long_integer, integer) key in
string form "long_integer:integer" since bsddb doesn't support keys
that aren't integers or strings.

Click to expand...

>
I have to admit, that I haven't wrote any own code to actually test
this, but if 20m25s for storing of a single MByte of strings in a
database table index column is really what you are getting, I can't get
rid of the feeling, that there is something elementary wrong with your
way doing it.

Hi Claudio.

1M is one million, referring to the number of insertions of keys;
not a Megabyte. I'm sorry that you took it that way :-(

Berkeley DB is great for accessing data by key for things already
stored on disk (i.e. read access), but write performance for each
key-value pair is slow due to it being careful about flushing
writes to disk by default.

Posting the code for your test cases appears to me to be the only option
to see what is the reason for the mystery you are getting here (this
will clarify also the other mysterious things considered by the posters
to this thread up to now).

I agree that posting some test code would have proved useful, but
the code is big and has too many interdependencies on external
things (e.g. databases, threads & pyro RPC calls) to allow me
to separate out examples easily. But if you go back to my
original posting, I think my question was quite clear.

Best regards,
Chris

Claudio Grondi · May 16, 2006

Chris said:
Claudio said:

Chris said:

However, please note that the Python bsddb module doesn't support
in-memory based databases - note the library documentation's[1] wording:

"Files never intended to be preserved on disk may be created
by passing None as the filename."

which closely mirrors the Sleepycat documentation[2]:

"In-memory databases never intended to be preserved on disk
may be created by setting the file parameter to NULL."

It does actually use a temporary file (in /var/tmp), for which
performance for my purposes is unsatisfactory:

# keys dictionary metakit bsddb (all using psyco)
------ ---------- ------- -----
1M 8.8s 22.2s 20m25s[3]
2M 24.0s 43.7s N/A
5M 115.3s 105.4s N/A

Cheers,
Chris

[1] bsddb docs:
http://www.python.org/doc/current/lib/module-bsddb.html

[2] Sleepycat BerkeleyDB C API:
http://www.sleepycat.com/docs/api_c/db_open.html

[3] Wall clock time. Storing the (long_integer, integer) key in
string form "long_integer:integer" since bsddb doesn't support keys
that aren't integers or strings.

Click to expand...

I have to admit, that I haven't wrote any own code to actually test
this, but if 20m25s for storing of a single MByte of strings in a
database table index column is really what you are getting, I can't
get rid of the feeling, that there is something elementary wrong with
your way doing it.

Click to expand...

Hi Claudio.

1M is one million, referring to the number of insertions of keys;
not a Megabyte. I'm sorry that you took it that way :-(

Berkeley DB is great for accessing data by key for things already
stored on disk (i.e. read access), but write performance for each
key-value pair is slow due to it being careful about flushing
writes to disk by default.

Posting the code for your test cases appears to me to be the only
option to see what is the reason for the mystery you are getting here
(this will clarify also the other mysterious things considered by the
posters to this thread up to now).

Click to expand...

I agree that posting some test code would have proved useful, but
the code is big and has too many interdependencies on external
things (e.g. databases, threads & pyro RPC calls) to allow me
to separate out examples easily. But if you go back to my
original posting, I think my question was quite clear.

Best regards,
Chris

I have some code demonstrating the mystery from my point of view
(including timings which differ from what you experience in the order of
a magnitude on my 2.8 GHz P4):

dctA = {}
import time
strt = time.clock()
lst = range(123456789010000000L, 123456789010000000L + 5000000)
import random
endt = time.clock()
print "lst = range(12345678901L, 12345678901L + 5000000) : ", endt -
strt
strt = time.clock()
random.shuffle(lst)
endt = time.clock()
print "random.shuffle(lst) : ", endt -
strt
random.shuffle(lst)
counter = 0
for item in lst:
if (counter == 0):
print "Listing of some of lst items:"
if (counter > 4):
print ":END of listing of lst items"
break
else:
print "item no. %i "%(counter,), repr(item)
counter += 1
#:for
# raw_input('continue with ENTER) #>')
strt = time.clock()
dctA.fromkeys(lst, None)
endt = time.clock()
print "dctA.fromkeys(lst, None) : ", endt -
strt
# raw_input('continue with ENTER) #>')
strt = time.clock()
strLst = []
for item in lst: strLst.append(str(item))
dctA = {}
endt = time.clock()
print "for item in lst: strLst.append(str(item)) : ", endt -
strt
# raw_input('continue with ENTER) #>')
strt = time.clock()
dctA = dctA.fromkeys(strLst, None)
endt = time.clock()
print "dctA.fromkeys(strLst, None) : ", endt -
strt
# raw_input('continue with ENTER) #>')
print "len(dctA) : %i " % (len(dctA),)
counter = 0
for key in dctA.keys():
if (counter == 0):
print "Listing of some of dctA items:"
if (counter > 4):
print ":END of listing of dctA items"
break
else:
print "key no. %i "%(counter,), repr(key)
counter += 1
#:for
raw_input('exit with ENTER) #>')

# Gives as output :
#
# lst = range(12345678901L, 12345678901L + 5000000) : 0.81327347470
# random.shuffle(lst) : 8.06178829991
# Listing of some of lst items:
# item no. 0 123456789011010733L
# item no. 1 123456789013585761L
# item no. 2 123456789013610266L
# item no. 3 123456789011311029L
# item no. 4 123456789010968853L
# :END of listing of lst items
# dctA.fromkeys(lst, None) : 3.11773256098
# for item in lst: strLst.append(str(item)) : 11.5454232312
# dctA.fromkeys(strLst, None) : 3.98027849908
# len(dctA) : 5000000
# Listing of some of dctA items:
# key no. 0 '123456789014515319'
# key no. 1 '123456789014116699'
# key no. 2 '123456789014116698'
# key no. 3 '123456789010800915'
# key no. 4 '123456789014116696'
# :END of listing of dctA items

So the insertion into the dictionary is according to you:
# keys dictionary
5M 115.3s
according to me:
5M 3.9s

I need according to the task manager about 0.5 GByte memory for this
(including the lists I keep alive).

I conclude from that, that your results from Berkeley approach are also
a case of unfortunate way of doing things (in Python and with the DB)
and using the Python dictionaries is here the simplest way to go (I
suppose that your main problem is, that you are not using .fromkeys() on
a list of your input data).

Claudio

Klaas · May 16, 2006

22.2s 20m25s[3]

20m to insert 1m keys? You are doing something wrong.

With bdb's it is crucial to insert keys in bytestring-sorted order.
Also, be sure to give it a decent amount of cache.

-Mike

Chris Foote · May 17, 2006

Klaas said:
22.2s 20m25s[3]

Click to expand...

20m to insert 1m keys? You are doing something wrong.

Hi Mike.

I've put together some simplified test code, but the bsddb
module gives 11m for 1M keys:

Number generator test for 1000000 number ranges
with a maximum of 3 wildcard digits.
Wed May 17 22:18:17 2006 dictionary population started
Wed May 17 22:18:26 2006 dictionary population stopped, duration 8.6s
Wed May 17 22:18:27 2006 StorageBerkeleyDB population started
Wed May 17 22:29:32 2006 StorageBerkeleyDB population stopped, duration 665.6s
Wed May 17 22:29:33 2006 StorageSQLite population started
Wed May 17 22:30:38 2006 StorageSQLite population stopped, duration 65.5s

test code is attached.

With bdb's it is crucial to insert keys in bytestring-sorted order.

For the bsddb test, I'm using a plain string. (The module docs list a
string being the only datatype supported for both keys & values).

Also, be sure to give it a decent amount of cache.

The bsddb.hashopen() factory seems to have a bug in this regard; if you
supply a cachesize argument, then it barfs:

....
File "bsddb-test.py", line 67, in runtest
db = bsddb.hashopen(None, flag='c', cachesize=8192)
File "/usr/lib/python2.4/bsddb/__init__.py", line 288, in hashopen
if cachesize is not None: d.set_cachesize(0, cachesize)
bsddb._db.DBInvalidArgError: (22, 'Invalid argument -- DB->set_cachesize: method not permitted when environment
specified')

I'll file a bug report on this if it isn't already fixed.

Cheers,
Chris

Claudio Grondi · May 17, 2006

Chris said:
Klaas said:

22.2s 20m25s[3]

Click to expand...

20m to insert 1m keys? You are doing something wrong.

Click to expand...

Hi Mike.

I've put together some simplified test code, but the bsddb
module gives 11m for 1M keys:

I have run your code for the bsddb on my P4 2.8 GHz and have got:
Number generator test for 1000000 number ranges
with a maximum of 3 wildcard digits.
Wed May 17 16:34:06 2006 dictionary population started
Wed May 17 16:34:14 2006 dictionary population stopped, duration 8.4s
Wed May 17 16:34:14 2006 StorageBerkeleyDB population started
Wed May 17 16:35:59 2006 StorageBerkeleyDB population stopped, duration
104.3s
Surprising here, that the dictionary population gives the same time, but
the BerkeleyDB inserts the records 6 times faster on my computer than on
yours. I am running Python 2.4.2 on Windows XP SP2, and you?

Number generator test for 1000000 number ranges
with a maximum of 3 wildcard digits.
Wed May 17 22:18:17 2006 dictionary population started
Wed May 17 22:18:26 2006 dictionary population stopped, duration 8.6s
Wed May 17 22:18:27 2006 StorageBerkeleyDB population started
Wed May 17 22:29:32 2006 StorageBerkeleyDB population stopped, duration
665.6s
Wed May 17 22:29:33 2006 StorageSQLite population started
Wed May 17 22:30:38 2006 StorageSQLite population stopped, duration 65.5s

As I don't have SQLite installed, it is interesting to see if the factor
10 in the speed difference between BerkeleyDB and SQLite can be
confirmed by someone else.
Why is SQLite faster here? I suppose, that SQLite first adds all the
records and builds the index afterwards with all the records there (with
db.commit()).
Can the same be done in BerkeleyDB, or does BerkeleyDB not support
inserting of records without building an index each single insert
command? If yes, how?

Claudio

test code is attached.

With bdb's it is crucial to insert keys in bytestring-sorted order.

Click to expand...

For the bsddb test, I'm using a plain string. (The module docs list a
string being the only datatype supported for both keys & values).

Also, be sure to give it a decent amount of cache.

Click to expand...

The bsddb.hashopen() factory seems to have a bug in this regard; if you
supply a cachesize argument, then it barfs:

....
File "bsddb-test.py", line 67, in runtest
db = bsddb.hashopen(None, flag='c', cachesize=8192)
File "/usr/lib/python2.4/bsddb/__init__.py", line 288, in hashopen
if cachesize is not None: d.set_cachesize(0, cachesize)
bsddb._db.DBInvalidArgError: (22, 'Invalid argument --
DB->set_cachesize: method not permitted when environment specified')

I'll file a bug report on this if it isn't already fixed.

Cheers,
Chris

------------------------------------------------------------------------

import bsddb
import random
import time
import sqlite

import psyco
psyco.full()

class WallClockTimer(object):
'''Simple timer for measuring tests.'''
def __init__(self, msg):
self.start = time.time()
self.msg = msg
print time.ctime(self.start), self.msg, 'started'

def stop(self):
end = time.time()
duration = end - self.start
print time.ctime(end), self.msg, 'stopped, duration %.1fs' % duration

class NumberGen(object):
'''Phone number generator for testing different methods of storage.'''
def __init__(self, range_start, range_end, n, max_wildcards):

print '''Number generator test for %d number ranges
with a maximum of %d wildcard digits.''' % (n, max_wildcards)

wildcards = range(0, max_wildcards + 1)
# generate n unique numbers and store as keys in number_hash
timer = WallClockTimer('dictionary population')
self.number_hash = {}

for i in xrange(0, n):
unique = False
while not unique:
wildcard_digits = self.gen_wildcard_digits(wildcards)
num = self.gen_number(range_start, range_end)
if wildcard_digits > 0:
num /= (10 ** wildcard_digits)
key = (num, wildcard_digits)
if self.number_hash.has_key(key):
unique = False
else:
unique = True
self.number_hash[key] = None
timer.stop()

def gen_wildcard_digits(self, wildcards):
return random.choice(wildcards)

def gen_number(self, start, end):
return int(random.uniform(start, end))

def storage_test(self, StorageTestClass):
test = StorageTestClass(self.number_hash)

class StorageTest(object):
'''base class for testing storage. Derive a test
class and provide your own runtest() method.'''
def __init__(self, number_hash):
timer = WallClockTimer('%s population' % type(self).__name__)
self.runtest(number_hash)
timer.stop()

class StorageBerkeleyDB(StorageTest):
def runtest(self, number_hash):
db = bsddb.hashopen(None, flag='c', cachesize=8192)
for (num, wildcard_digits) in number_hash.keys():
key = '%d:%d' % (num, wildcard_digits)
db[key] = None
db.close()

class StorageSQLite(StorageTest):
def runtest(self, number_hash):
db = sqlite.connect(':memory:')
cursor = db.cursor()
cursor.execute('CREATE TABLE numbers (num INTEGER, wd INTEGER)')
q = """insert into numbers (num, wd) values (%d, %d)"""
for (num, wildcard_digits) in number_hash.keys():
cursor.execute(q, num, wildcard_digits)
db.commit()
db.close()

if __name__ == '__main__':

test_vals = {'range_start' : 61880 * 10 ** 7,
'range_end' : 61891 * 10 ** 7,
'n' : 1 * 10 ** 4,
'max_wildcards' : 3}

ngen = NumberGen(test_vals['range_start'], test_vals['range_end'],
test_vals['n'], test_vals['max_wildcards'])

ngen.storage_test(StorageBerkeleyDB)
ngen.storage_test(StorageSQLite)

Dennis Lee Bieber · May 18, 2006

As I don't have SQLite installed, it is interesting to see if the factor

Who does... pysqlite2 is the current module to my knowledge; I had
to edit the code to replace sqlite with pysqlite2... I don't have psyco
so that had to be commented out... And the supplied code only ran for
10,000 so that had to change too...

With the above changes:

E:\UserData\Dennis Lee Bieber\My Documents\Python Progs>python
bsddb-test.py
Number generator test for 1000000 number ranges
with a maximum of 3 wildcard digits.
Wed May 17 21:05:48 2006 dictionary population started
Wed May 17 21:06:05 2006 dictionary population stopped, duration 16.8s
Wed May 17 21:06:05 2006 StorageBerkeleyDB population started
Wed May 17 21:07:16 2006 StorageBerkeleyDB population stopped, duration
70.7s
Wed May 17 21:07:16 2006 StorageSQLite population started
Wed May 17 21:07:37 2006 StorageSQLite population stopped, duration
21.5s

E:\UserData\Dennis Lee Bieber\My Documents\Python Progs>

Remember, the above is without psyco compilation!

WinXP Pro SP2
3.4GHz P4 HT
2GB RAM
(Active)Python 2.3.5

Task manager showed 50% CPU; commit charge grew from 380MB to 475MB
during phase 1, stayed flat for phase 2, and hit 491MB just before phase
3 ended.

With the as-given test range, the timings are almost immeasurable

E:\UserData\Dennis Lee Bieber\My Documents\Python Progs>python
bsddb-test.py
Number generator test for 10000 number ranges
with a maximum of 3 wildcard digits.
Wed May 17 21:12:35 2006 dictionary population started
Wed May 17 21:12:35 2006 dictionary population stopped, duration 0.1s
Wed May 17 21:12:35 2006 StorageBerkeleyDB population started
Wed May 17 21:12:35 2006 StorageBerkeleyDB population stopped, duration
0.4s
Wed May 17 21:12:35 2006 StorageSQLite population started
Wed May 17 21:12:35 2006 StorageSQLite population stopped, duration 0.2s

--
Wulfraed Dennis Lee Bieber KD6MOG
(e-mail address removed) (e-mail address removed)
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: (e-mail address removed))
HTTP://www.bestiaria.com/

Chris Foote · May 18, 2006

Richie said:
[Chris]

Has anyone written a fast hash module which is more optimal for
large datasets ?

Click to expand...

PyJudy might be what you're looking for, though I've never used it:

http://www.dalkescientific.com/Python/PyJudy.html

"Judy's key benefits are scalability, high performance, and memory
efficiency. A Judy array is extensible and can scale up to a very large
number of elements, bounded only by machine memory." ... "PyJudy arrays
are similar to Python dictionaries and sets."

Thanks for the suggestion Richie. PyJudy works brilliantly

Cheers,
Chris

Chris Foote · May 18, 2006

Claudio said:
Chris said:

Klaas said:

22.2s 20m25s[3]

20m to insert 1m keys? You are doing something wrong.

Click to expand...

I've put together some simplified test code, but the bsddb
module gives 11m for 1M keys:

Click to expand...

I have run your code for the bsddb on my P4 2.8 GHz and have got:
Number generator test for 1000000 number ranges
with a maximum of 3 wildcard digits.
Wed May 17 16:34:06 2006 dictionary population started
Wed May 17 16:34:14 2006 dictionary population stopped, duration 8.4s
Wed May 17 16:34:14 2006 StorageBerkeleyDB population started
Wed May 17 16:35:59 2006 StorageBerkeleyDB population stopped, duration
104.3s
>
Surprising here, that the dictionary population gives the same time, but
the BerkeleyDB inserts the records 6 times faster on my computer than on
yours. I am running Python 2.4.2 on Windows XP SP2, and you?

Fedora core 5 with ext3 filesystem. The difference will be due to
the way that Windows buffers writes for the filesystem you're using
(it sounds like you're using a FAT-based file system).

As I don't have SQLite installed, it is interesting to see if the factor
10 in the speed difference between BerkeleyDB and SQLite can be
confirmed by someone else.
Why is SQLite faster here? I suppose, that SQLite first adds all the
records and builds the index afterwards with all the records there (with
db.commit()).

SQLite is way faster because BerkeleyDB always uses a disk file,
and SQLite is in RAM only.

Cheers,
Chris

Claudio Grondi · May 18, 2006

Chris said:
Claudio said:

Chris said:

Klaas wrote:

22.2s 20m25s[3]

20m to insert 1m keys? You are doing something wrong.

I've put together some simplified test code, but the bsddb
module gives 11m for 1M keys:

Click to expand...

I have run your code for the bsddb on my P4 2.8 GHz and have got:
Number generator test for 1000000 number ranges
with a maximum of 3 wildcard digits.
Wed May 17 16:34:06 2006 dictionary population started
Wed May 17 16:34:14 2006 dictionary population stopped, duration 8.4s
Wed May 17 16:34:14 2006 StorageBerkeleyDB population started
Wed May 17 16:35:59 2006 StorageBerkeleyDB population stopped,
duration 104.3s

Surprising here, that the dictionary population gives the same time,
but the BerkeleyDB inserts the records 6 times faster on my computer
than on yours. I am running Python 2.4.2 on Windows XP SP2, and you?

Click to expand...

Fedora core 5 with ext3 filesystem. The difference will be due to
the way that Windows buffers writes for the filesystem you're using
(it sounds like you're using a FAT-based file system).

Ok, according to the Windows task manager the Python process
reads/writes to the file system during the run of BerkeleyDB test around
7 GByte(!) of data and the hard drive is continuously busy, where the
size of file I found in the Temp directory is always below 20 MByte. The
hard drive access is probably the main reason for loosing time - here a
question to BerkeleyDB experts:

Can the BerkeleyDB via Python bsddb3 interface be tuned to use only RAM
or as BerkeleyDB can scale to larger data amount it makes not much sense
to tweak it into RAM?

Chris, is maybe a RAM-disk the right way to go here to save time lost
for accessing the file stored in the file system on the hard drive?

The RAM requirements, according to Windows XP task manager, are below
100 MByte. I am using the NTFS file system (yes, I know, that FAT is in
some configurations faster than NTFS) and XP Professional SP2 without
any tuning of file system caching. The CPU is 100% busy.

What CPU and RAM (SIMM, DDR, DDR2) do you have? I have 2GByte fast DDR
PC400/3200 dual line RAM. It seems, that you are still not getting
results within the range others experience running your code, so I
suppose, it has something to do with the hardware you are using.

SQLite is way faster because BerkeleyDB always uses a disk file,
and SQLite is in RAM only.

One of the reasons I put an eye on BerkeleyDB is that it pretends to
scale to a huge amount (Terrabyte) of data and don't need as much RAM as
Python dictionary and it is not necessary to save/load pickled version
of the data (i.e. here the dictionary) from/to RAM in order to work with
it.
I guess, that in your case BerkeleyDB is for the named reasons probably
the right way to go, except your data will stay small and the Python
dictionary with them will always fit into RAM.

Now I am curious to know which path you have decided to go and why?

Claudio

Claudio Grondi · May 18, 2006

Chris said:
Richie said:

[Chris]

Has anyone written a fast hash module which is more optimal for
large datasets ?

Click to expand...

PyJudy might be what you're looking for, though I've never used it:

http://www.dalkescientific.com/Python/PyJudy.html

"Judy's key benefits are scalability, high performance, and memory
efficiency. A Judy array is extensible and can scale up to a very large
number of elements, bounded only by machine memory." ... "PyJudy arrays
are similar to Python dictionaries and sets."

Click to expand...

Thanks for the suggestion Richie. PyJudy works brilliantly

Cheers,
Chris

It seems, that there is no Microsoft Windows version of Judy available,
so for this reason PyJudy won't work on Windows - am I right?

Claudio

Klaas · May 25, 2006

Chris:

Berkeley DB is great for accessing data by key for things already
stored on disk (i.e. read access), but write performance for each
key-value pair is slow due to it being careful about flushing
writes to disk by default.

This is absolutely false.

-Mike

Klaas · May 25, 2006

Chris:

class StorageBerkeleyDB(StorageTest):
def runtest(self, number_hash):
db = bsddb.hashopen(None, flag='c', cachesize=8192)
for (num, wildcard_digits) in number_hash.keys():
key = '%d:%d' % (num, wildcard_digits)
db[key] = None
db.close()

BDBs can accomplish what you're looking to do, but they need to be
tuned carefully. I won't get into too many details here, but you have
a few fatal flaws in that code.

1. 8Kb of cache is _pathetic_. Give it a few hundred megs. This is by
far your nbiggest problem.
2. Use BTREE unless you have a good reason to use DBHASH
3. Use proper bdb env creation instead of the hash_open apis.
4. Insert your keys in sorted order.

-Mike

ZODB memory problems (was: processing a Very Large file)	1	May 21, 2005
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	15	Apr 1, 2006

Large Dictionaries

Vladimir 'Yu' Stepanov

Duncan Booth

Chris Foote

Claudio Grondi

jantod

Duncan Booth

Duncan Booth

Sion Arrowsmith

Chris Foote

Claudio Grondi

Klaas

Chris Foote

Claudio Grondi

Dennis Lee Bieber

Chris Foote

Chris Foote

Claudio Grondi

Claudio Grondi

Klaas

Klaas

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads