Adding tuples to a dictionary

Guest · May 31, 2007

Hi Pythonistas!

I've got a question about storing tuples in a dictionary. First, a
small test case which creates a list of dictionaries:

import time

list_of_dicts = []
keys = [str(x) for x in range(200000)]
prev_clk = time.clock()
for i in range(20):
my_dict = {}
for key in keys:
my_dict[key] = key
list_of_dicts.append(my_dict)
new_clk = time.clock()
print i, new_clk - prev_clk
prev_clk = new_clk

It creates dictionaries and stores them in a list, printing out
execution times. The size of each dictionary is constant, so is the
execution time for each iteration.

0 0.1
1 0.1
2 0.1
3 0.08
4 0.09

....and so on.

Then, just one line is changed:
my_dict[key] = key
into:
my_dict[key] = (key, key)

Full code:

list_of_dicts = []
keys = [str(x) for x in range(200000)]
prev_clk = time.clock()
for i in range(20):
my_dict = {}
for key in keys:
my_dict[key] = (key, key)
list_of_dicts.append(my_dict)
new_clk = time.clock()
print i, new_clk - prev_clk
prev_clk = new_clk

The difference is that instead of single values, tuples are added to
the dictionary instead. When the program is run again:

0 0.27
1 0.37
2 0.49
3 0.6
....
16 2.32
17 2.45
18 2.54
19 2.68

The execution time is rising linearly with every new dictionary
created.

Next experiment: dictionaries are not stored in a list, they are just
left out when an iteration has finished. It's done by removing two
lines:

list_of_dicts = []

and

list_of_dicts.append(my_dict)

Full code:

keys = [str(x) for x in range(200000)]
prev_clk = time.clock()
for i in range(20):
my_dict = {}
for key in keys:
my_dict[key] = (key, key)
new_clk = time.clock()
print i, new_clk - prev_clk
prev_clk = new_clk

The time is constant again:

0 0.28
1 0.28
2 0.28
3 0.26
4 0.26

I see no reason for this kind of performance problem, really. It
happens when both things are true: dictionaries are kept in a list (or
more generally, in memory) and they store tuples.

As this goes beyond my understanding of Python internals, I would like
to kindly ask, if anyone has an idea about how to create this data
structure (list of dictionaries of tuples, assuming that size of all
dictionaries is the same), in constant time?

Regards,
Maciej

Peter Otten · May 31, 2007

Maciej said:
Hi Pythonistas!

I've got a question about storing tuples in a dictionary. First, a
small test case which creates a list of dictionaries:

import time

list_of_dicts = []
keys = [str(x) for x in range(200000)]
prev_clk = time.clock()
for i in range(20):
my_dict = {}
for key in keys:
my_dict[key] = key
list_of_dicts.append(my_dict)
new_clk = time.clock()
print i, new_clk - prev_clk
prev_clk = new_clk

It creates dictionaries and stores them in a list, printing out
execution times. The size of each dictionary is constant, so is the
execution time for each iteration.

0 0.1
1 0.1
2 0.1
3 0.08
4 0.09

...and so on.

Then, just one line is changed:
my_dict[key] = key
into:
my_dict[key] = (key, key)

Full code:

list_of_dicts = []
keys = [str(x) for x in range(200000)]
prev_clk = time.clock()
for i in range(20):
my_dict = {}
for key in keys:
my_dict[key] = (key, key)
list_of_dicts.append(my_dict)
new_clk = time.clock()
print i, new_clk - prev_clk
prev_clk = new_clk

The difference is that instead of single values, tuples are added to
the dictionary instead. When the program is run again:

0 0.27
1 0.37
2 0.49
3 0.6
...
16 2.32
17 2.45
18 2.54
19 2.68

The execution time is rising linearly with every new dictionary
created.

Next experiment: dictionaries are not stored in a list, they are just
left out when an iteration has finished. It's done by removing two
lines:

list_of_dicts = []

and

list_of_dicts.append(my_dict)

Full code:

keys = [str(x) for x in range(200000)]
prev_clk = time.clock()
for i in range(20):
my_dict = {}
for key in keys:
my_dict[key] = (key, key)
new_clk = time.clock()
print i, new_clk - prev_clk
prev_clk = new_clk

The time is constant again:

0 0.28
1 0.28
2 0.28
3 0.26
4 0.26

I see no reason for this kind of performance problem, really. It
happens when both things are true: dictionaries are kept in a list (or
more generally, in memory) and they store tuples.

As this goes beyond my understanding of Python internals, I would like
to kindly ask, if anyone has an idea about how to create this data
structure (list of dictionaries of tuples, assuming that size of all
dictionaries is the same), in constant time?

Disable garbage collection:

import gc
gc.disable()

It uses inappropriate heuristics in this case.

Peter

bvukov · May 31, 2007

Hi Pythonistas!

I've got a question about storing tuples in a dictionary. First, a
small test case which creates a list of dictionaries:

import time

list_of_dicts = []
keys = [str(x) for x in range(200000)]
prev_clk = time.clock()
for i in range(20):
my_dict = {}
for key in keys:
my_dict[key] = key
list_of_dicts.append(my_dict)
new_clk = time.clock()
print i, new_clk - prev_clk
prev_clk = new_clk

It creates dictionaries and stores them in a list, printing out
execution times. The size of each dictionary is constant, so is the
execution time for each iteration.

0 0.1
1 0.1
2 0.1
3 0.08
4 0.09

...and so on.

Then, just one line is changed:
my_dict[key] = key
into:
my_dict[key] = (key, key)

Full code:

list_of_dicts = []
keys = [str(x) for x in range(200000)]
prev_clk = time.clock()
for i in range(20):
my_dict = {}
for key in keys:
my_dict[key] = (key, key)
list_of_dicts.append(my_dict)
new_clk = time.clock()
print i, new_clk - prev_clk
prev_clk = new_clk

The difference is that instead of single values, tuples are added to
the dictionary instead. When the program is run again:

0 0.27
1 0.37
2 0.49
3 0.6
...
16 2.32
17 2.45
18 2.54
19 2.68

The execution time is rising linearly with every new dictionary
created.

Next experiment: dictionaries are not stored in a list, they are just
left out when an iteration has finished. It's done by removing two
lines:

list_of_dicts = []

and

list_of_dicts.append(my_dict)

Full code:

keys = [str(x) for x in range(200000)]
prev_clk = time.clock()
for i in range(20):
my_dict = {}
for key in keys:
my_dict[key] = (key, key)
new_clk = time.clock()
print i, new_clk - prev_clk
prev_clk = new_clk

The time is constant again:

0 0.28
1 0.28
2 0.28
3 0.26
4 0.26

I see no reason for this kind of performance problem, really. It
happens when both things are true: dictionaries are kept in a list (or
more generally, in memory) and they store tuples.

As this goes beyond my understanding of Python internals, I would like
to kindly ask, if anyone has an idea about how to create this data
structure (list of dictionaries of tuples, assuming that size of all
dictionaries is the same), in constant time?

Regards,
Maciej

Let me comment on what happens in you're code:
The place where you create new objects is
keys = [str(x) for x in range(200000)] # here you create 200000
strings which will be reused ( by reference )
and
my_dict[key] = (key, key) # here you create a new tuple with 2
elements
( both are key, so you're taking a
reference of existing key object twice )
The tricky part is where you wrote:
for key in keys:
my_dict[key] = (key, key)
list_of_dicts.append(my_dict) # note that
list_of_dicts.append is in the loop! check upstairs!
This means that my_dict reference will be stored 200000 times, and it
won't be released.
statement
my_dict = {}
will always create new my_dict ( 20 times means 20 new dictionaries )
and start over.
Since python caches free dictionaries ( after delete - they're used
everywhere ),
reuse won't happen, and memory will have to be allocated again.
Lists are internally like arrays, when there is not enough space for
next element, pointer array is doubled, so
there is no huge penalty in the append function. Lists are also reused
from some internal cache.
Dictionaries also have a some growth function. When there is no space
for next key, internal hash map doubles.
The reason why you have a growing time comes from the fact that memory
allocation takes place instead of object
being used by reference. Check the memory usage, and you'll see that
test time is pretty much proportional to overall memory usage.

Regards, Bosko

Nick Craig-Wood · Jun 1, 2007

Let me comment on what happens in you're code:
The place where you create new objects is
keys = [str(x) for x in range(200000)] # here you create 200000
strings which will be reused ( by reference )
and
my_dict[key] = (key, key) # here you create a new tuple with 2
elements
( both are key, so you're taking a
reference of existing key object twice )
The tricky part is where you wrote:
for key in keys:
my_dict[key] = (key, key)
list_of_dicts.append(my_dict) # note that
list_of_dicts.append is in the loop! check upstairs!
This means that my_dict reference will be stored 200000 times, and it
won't be released.
statement
my_dict = {}
will always create new my_dict ( 20 times means 20 new dictionaries )
and start over.
Since python caches free dictionaries ( after delete - they're used
everywhere ),
reuse won't happen, and memory will have to be allocated again. [snip]
The reason why you have a growing time comes from the fact that memory
allocation takes place instead of object
being used by reference. Check the memory usage, and you'll see that
test time is pretty much proportional to overall memory usage.

That agrees with my analysis also - it is all about memory usage due
to the duplicated (key, key) tuples. In fact it puts my machine into
swap at the larger iteration numbers!

Here is storing them in a cache which runs nice and fast!

import time

list_of_dicts = []
keys = [str(x) for x in range(200000)]
tuple_cache = {}
prev_clk = time.clock()
for i in range(20):
my_dict = {}
for key in keys:
value = tuple_cache.setdefault((key, key), (key, key))
my_dict[key] = value
list_of_dicts.append(my_dict)
new_clk = time.clock()
print i, new_clk - prev_clk
prev_clk = new_clk

0 1.0
1 0.39
2 0.41
3 0.39
4 0.39
5 0.4
6 0.41
7 0.39
8 0.4
9 0.4
10 0.39
11 0.4
12 0.4
13 0.39
14 0.4
15 0.4
16 0.39
17 0.4
18 0.39
19 0.41

Note the first iteration is slower as it builds the tuple cache

Maciej Blizi ski · Jun 11, 2007

Thank you for all the answers! My problem is solved even better than I
expected!

@Peter: Yes, the garbage collector was causing the slowdown. Switching
it off sped the program up; each iteration was taking the same amount
of time. I ran collection manually every 10 iterations to control
memory usage.

@Bosko: Right, this example code had a bug, appending to the
dictionary was meant to be one nesting level up.

@Nick: Using a tuple cache is a great advice! My real program's tuples
are even better to cache, as there are roughly about 10 different
tuples that are used as values. Using a tuple cache reduced the memory
consumption by about 3-4 times (I'm telling that by looking at the
gkrellm display).

Thanks!
Maciej

List comp help	0	Jul 14, 2013
How to turn a list of tuples into a dictionary?	5	Feb 26, 2008
A more pythonish code	3	Feb 25, 2010
Dictionary Descriptors	0	Mar 31, 2011
executor.map() TypeError: zip argument #2 must support iteration	2	Apr 1, 2013
Creating a dictionary from a .txt file	18	Mar 31, 2013
Building a dict from a tuple of tuples	4	Feb 20, 2010
Help building a dictionary of lists	1	Nov 12, 2012

Adding tuples to a dictionary

Guest

Peter Otten

bvukov

Nick Craig-Wood

Maciej Blizi ski

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads