a question on python dict

C

could.net

Python dict is a hash table, isn't it?

I know that hashtable has the concept of "bucket size" and "min bucket
count" stuff,
and they should be configurable so I can set them to the proper value
when I know how many items I'm going to handle.

If these two values can't be set, the hashtable will give them default
values. When there are more and more items being added to the
hashtable, it increase its buckets and copy the old items into the new
one and re-calculate the hash value of each item.

I think this will waste some time doing the increasing-copying thing.
If I know I'm going to handle about 20000 items, I can set the size of
hashtable to 30000.

So, can I do this in python? I can't figure it out so I come here for
help.

Thanks!
 
T

Tim Peters

[[email protected]]
Python dict is a hash table, isn't it?
Yup.

I know that hashtable has the concept of "bucket size" and "min bucket
count" stuff,

Some implementations of hash tables do. Python's does not. Python's
uses what's called "open addressing" instead.
and they should be configurable so I can set them to the proper value
when I know how many items I'm going to handle.

If these two values can't be set, the hashtable will give them default
values. When there are more and more items being added to the
hashtable, it increase its buckets and copy the old items into the new
one and re-calculate the hash value of each item.

That's several distinct claims, each of which is true of some hash
table implementations but not of others. Python's implementation has
no "buckets", so all that's simply irrelevant. Python's
implementation (not all do) saves the hash value of each item, so when
it does need to grow (or shrink) the space allocated to the table, it
does not need to recalculate the hash value of any item.

You should also note that copying a dict key or value (no matter of
what type) consists in its entirety of copying one machine address (a
4- or 8-byte pointer, depending on platform).
I think this will waste some time doing the increasing-copying thing.

It does take time to resize. "Waste" is a value judgment ;-)
If I know I'm going to handle about 20000 items, I can set the size of
hashtable to 30000.

So, can I do this in python?

You can't. Unless you build up to 20000 items over and over (many
thousands of times), it's unlikely you'd be able to measure the time
difference even if you could.
 
C

could.net

Thank you very much for your explanation!

I made a mistake that I said the hash value should be recalculated each
time the dict resize, what I wanted to say was that the position of
each item should be recalculated.

Maybe I should take a look at the source code of python dict some day.
I'll some here for help then.

Thanks again!


Tim said:
[[email protected]]
Python dict is a hash table, isn't it?
Yup.

I know that hashtable has the concept of "bucket size" and "min bucket
count" stuff,

Some implementations of hash tables do. Python's does not. Python's
uses what's called "open addressing" instead.
and they should be configurable so I can set them to the proper value
when I know how many items I'm going to handle.

If these two values can't be set, the hashtable will give them default
values. When there are more and more items being added to the
hashtable, it increase its buckets and copy the old items into the new
one and re-calculate the hash value of each item.

That's several distinct claims, each of which is true of some hash
table implementations but not of others. Python's implementation has
no "buckets", so all that's simply irrelevant. Python's
implementation (not all do) saves the hash value of each item, so when
it does need to grow (or shrink) the space allocated to the table, it
does not need to recalculate the hash value of any item.

You should also note that copying a dict key or value (no matter of
what type) consists in its entirety of copying one machine address (a
4- or 8-byte pointer, depending on platform).
I think this will waste some time doing the increasing-copying thing.

It does take time to resize. "Waste" is a value judgment ;-)
If I know I'm going to handle about 20000 items, I can set the size of
hashtable to 30000.

So, can I do this in python?

You can't. Unless you build up to 20000 items over and over (many
thousands of times), it's unlikely you'd be able to measure the time
difference even if you could.
 
D

Duncan Booth

Thank you very much for your explanation!

I made a mistake that I said the hash value should be recalculated
each time the dict resize, what I wanted to say was that the position
of each item should be recalculated.

Maybe I should take a look at the source code of python dict some day.
I'll some here for help then.

In your original message you said:
I think this will waste some time doing the increasing-copying thing.
If I know I'm going to handle about 20000 items, I can set the size of
hashtable to 30000.

Have you actually tried timing how much time is wasted doing this? 20000
items really isn't very many for a dictionary to handle.

I ran a quick test to see how long it takes to create a dictionary with
20000 items. Remember, as has been explained to you, the items are not
copied, nor are the hash values recalculated when the dictionary is
resized, so if your dictionary takes much longer to create than the test
below gives you (on your machine that is rather than mine), then the
problem lies elsewhere. e.g. a slow hash function, or the time taken to
create whatever you are storing in the dictionary.

Just creating the dictionary with no other Python overhead:

timeit.py -s "import random; r = [ random.random() for i in range(20000)]"
"d = dict.fromkeys(r)"
100 loops, best of 3: 11.3 msec per loop

Creating the dictionary using a Python for loop:

timeit.py -s "import random; r = [ random.random() for i in range(20000)]"
"d={}" "for k in r: d[k] = None"
100 loops, best of 3: 11.7 msec per loop

Assigning to the elements in a pre-populated (and therefore the right
size) dictionary:

timeit.py -s "import random; r = [ random.random() for i in range(20000)];
d=dict.fromkeys(r)" "for k in r: d[k] = None"
100 loops, best of 3: 10.1 msec per loop

How many times do you create this dictionary that shaving 1.5 milliseconds
off 11 millseconds matters to you?

FWIW, for a 2,000,000 element dictionary the times are 2.5s and 1.48s
respectively so the situation does get progressively worse as dictionaries
get very large.
 
L

Lawrence D'Oliveiro

You should also note that copying a dict key or value (no matter of
what type) consists in its entirety of copying one machine address (a
4- or 8-byte pointer, depending on platform).

Actually, no. It also consists of updating reference counts as well.
 
T

Tim Peters

[Tim Peters]
[Lawrence D'Oliveiro]
Actually, no. It also consists of updating reference counts as well.

Not in context: dict resizing is refcount-neutral, and the CPython
implementation of dict resizing exploits that (quite directly in 2.5;
indirectly before 2.5). It's not just that refcounts end up the same
/across/ dict resizing, it's that they're not changed internally either for
the duration of the resizing operation.
 
L

Lawrence D'Oliveiro

[Tim Peters]
[Lawrence D'Oliveiro]
Actually, no. It also consists of updating reference counts as well.

Not in context: dict resizing is refcount-neutral...

The question wasn't about resizing, it was about "copying a dict key or
value".
 
T

Tim Peters

[Tim Peters]
[Lawrence D'Oliveiro]
[Tim Peters]
[Lawrence D'Oliveiro]
The question wasn't about resizing, it was about "copying a dict key or
value".

Then you misunderstood the OP. He was asking about dict resizing:

...

When there are more and more items being added to the
hashtable, it increase its buckets and copy the old items into the
new one and re-calculate the hash value of each item.

I think this will waste some time doing the increasing-copying
thing.

Taking my response out of context to begin with doesn't really change that
I answered the question he asked ;-)
 
L

Lawrence D'Oliveiro

[Tim Peters]
[Lawrence D'Oliveiro]
[Tim Peters]
[Lawrence D'Oliveiro]
The question wasn't about resizing, it was about "copying a dict key or
value".

Then you misunderstood the OP. He was asking about dict resizing:

Which is not what you were talking about, in the part of your posting that I
was responding to (above).
 
F

Fredrik Lundh

Tim said:
Taking my response out of context to begin with doesn't really change that
I answered the question he asked ;-)

welcome to comp.lang.python.

</F>
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,797
Messages
2,569,647
Members
45,377
Latest member
Zebacus

Latest Threads

Top