Oddity with large dictionary (several million entries)

Lothar Werzinger · Apr 27, 2010

Hi,

I am trying to load files into a dictionary for analysis. the size of the
dictionary will grow quite large (several million entries) and as inserting
into a dictionary is roughly O(n) I figured if I loaded each file into it's
own dictionary it would speed things up. However it did not.

So I decided to write a small test program (attached)

As you can see I am inserting one million entries a time into a map. I ran
the tests where I put all three million entries into one map and one where I
put one million each into it's own map.

What I would have expected is that if I insert one million into it's own map
the time to do that would be roughly constant for each map. Interestingly it
is not. It's about the same as if I load everything into one map.

Oh and I have 4G of RAM and the test consumes about 40% at it's max. I even
run the test on one of our servers with 64G of RAM, so I can rule out
swapping as the issue.

Can anyone explain this oddity? Any insight is highly appreciated.

Here's the output of the test runs:

$ ./dicttest.py -t 0
Inserting into one map
Inserting 1000000 keys lasted 0:00:26 (38019 1/s)
len(map) 1000000
Inserting 1000000 keys lasted 0:01:17 (12831 1/s)
len(map) 2000000
Inserting 1000000 keys lasted 0:02:23 (6972 1/s)
len(map) 3000000
total 3000000

$ ./dicttest.py -t 1
Inserting into three maps
Inserting 1000000 keys lasted 0:00:32 (30726 1/s)
len(map) 1000000
Inserting 1000000 keys lasted 0:01:29 (11181 1/s)
len(map) 1000000
Inserting 1000000 keys lasted 0:02:23 (6957 1/s)
len(map) 1000000
total 3000000

Thanks,
Lothar

,----[ /home/lothar/tmp/dicttest.py ]
| #!/usr/bin/python
| # -*- coding: utf-8 -*-
|
| import datetime
| import optparse
| import sys
| import time
|
|
|
|
| def fillDict(map, nop, num, guid):
| before = time.time()
|
| for i in xrange(0, num):
| key = (i, guid)
| if not nop:
| map[key] = ([], {})
|
| after = time.time()
| elapsed = (after - before)
| if elapsed <= 0:
| divide = 1.0
| else:
| divide = elapsed
| elapsedTime = datetime.timedelta(seconds=int(elapsed))
| print("Inserting %d keys lasted %s (%d 1/s)" % (num, elapsedTime, (num /
| divide))) print("len(map) %d" % (len(map)))
|
|
| def test0(nop, num):
| print("Inserting into one map")
| map = {}
| fillDict(map, nop, num, "0561c83c-9675-4e6f-bedc-86bcb6acfd71")
| fillDict(map, nop, num, "0561c83c-9675-4e6f-bedc-86bcb6acfd72")
| fillDict(map, nop, num, "0561c83c-9675-4e6f-bedc-86bcb6acfd73")
| print("total %d" % (len(map)))
|
|
| def test1(nop, num):
| print("Inserting into three maps")
| map1 = {}
| map2 = {}
| map3 = {}
| fillDict(map1, nop, num, "0561c83c-9675-4e6f-bedc-86bcb6acfd71")
| fillDict(map2, nop, num, "0561c83c-9675-4e6f-bedc-86bcb6acfd72")
| fillDict(map3, nop, num, "0561c83c-9675-4e6f-bedc-86bcb6acfd73")
| total = 0
| for map in [map1, map2, map3]:
| total += len(map)
| print("total %d" % (total))
|
|
|
| if __name__ == "__main__":
| usage = "USAGE: %prog [options]"
| description="test"
| version="%prog 1.0"
|
| parser = optparse.OptionParser(usage=usage, version=version,
| description=description) parser.add_option(
| "-t",
| "--testnum",
| action="store",
| dest="testnum",
| help="the number of the test to execute",
| type="int",
| default=1
| )
| parser.add_option(
| "-i",
| "--iterations",
| action="store",
| dest="iterations",
| help="the number of iterations",
| type="int",
| default=1000000
| )
| parser.add_option(
| "-n",
| "--nop",
| action="store_true",
| dest="nop",
| help="don't store in the dictionary only load and parse",
| default=False
| )
|
| (options, args) = parser.parse_args()
|
| testmap = {
| 0:test0,
| 1:test1,
| }
|
| test = testmap[options.testnum]
|
| test(options.nop, options.iterations)
`----

Peter Otten · Apr 27, 2010

Lothar said:
I am trying to load files into a dictionary for analysis. the size of the
dictionary will grow quite large (several million entries) and as
inserting into a dictionary is roughly O(n) I figured if I loaded each
file into it's own dictionary it would speed things up. However it did
not.

So I decided to write a small test program (attached)

As you can see I am inserting one million entries a time into a map. I ran
the tests where I put all three million entries into one map and one where
I put one million each into it's own map.

What I would have expected is that if I insert one million into it's own
map the time to do that would be roughly constant for each map.
Interestingly it is not. It's about the same as if I load everything into
one map.

Oh and I have 4G of RAM and the test consumes about 40% at it's max. I
even run the test on one of our servers with 64G of RAM, so I can rule out
swapping as the issue.

Can anyone explain this oddity? Any insight is highly appreciated.

When you are creating objects like there is no tomorrow Python's cyclic
garbage collections often takes a significant amount of time. The first
thing I'd try is therefore switching it off with

import gc
gc.disable()

Peter

MRAB · Apr 27, 2010

Lothar said:
Hi,

I am trying to load files into a dictionary for analysis. the size of the
dictionary will grow quite large (several million entries) and as inserting
into a dictionary is roughly O(n) I figured if I loaded each file into it's
own dictionary it would speed things up. However it did not.

So I decided to write a small test program (attached)

As you can see I am inserting one million entries a time into a map. I ran
the tests where I put all three million entries into one map and one where I
put one million each into it's own map.

What I would have expected is that if I insert one million into it's own map
the time to do that would be roughly constant for each map. Interestingly it
is not. It's about the same as if I load everything into one map.

Oh and I have 4G of RAM and the test consumes about 40% at it's max. I even
run the test on one of our servers with 64G of RAM, so I can rule out
swapping as the issue.

Can anyone explain this oddity? Any insight is highly appreciated.

Here's the output of the test runs:

$ ./dicttest.py -t 0
Inserting into one map
Inserting 1000000 keys lasted 0:00:26 (38019 1/s)
len(map) 1000000
Inserting 1000000 keys lasted 0:01:17 (12831 1/s)
len(map) 2000000
Inserting 1000000 keys lasted 0:02:23 (6972 1/s)
len(map) 3000000
total 3000000

$ ./dicttest.py -t 1
Inserting into three maps
Inserting 1000000 keys lasted 0:00:32 (30726 1/s)
len(map) 1000000
Inserting 1000000 keys lasted 0:01:29 (11181 1/s)
len(map) 1000000
Inserting 1000000 keys lasted 0:02:23 (6957 1/s)
len(map) 1000000
total 3000000

[snip]
Inserting into a Python dict is O(1).

Lothar Werzinger · Apr 27, 2010

Peter said:
When you are creating objects like there is no tomorrow Python's cyclic
garbage collections often takes a significant amount of time. The first
thing I'd try is therefore switching it off with

import gc
gc.disable()

Peter

Wow, that really MAKES a difference! Thanks a lot!

$ ~/tmp/dicttest.py -t 0
Inserting into one map
Inserting 1000000 keys lasted 0:00:01 (960152 1/s)
len(map) 1000000
Inserting 1000000 keys lasted 0:00:01 (846416 1/s)
len(map) 2000000
Inserting 1000000 keys lasted 0:00:04 (235241 1/s)
len(map) 3000000
total 3000000

$ ~/tmp/dicttest.py -t 1
Inserting into three maps
Inserting 1000000 keys lasted 0:00:01 (973344 1/s)
len(map) 1000000
Inserting 1000000 keys lasted 0:00:00 (1011303 1/s)
len(map) 1000000
Inserting 1000000 keys lasted 0:00:00 (1021796 1/s)
len(map) 1000000
total 3000000
<~/beacon>

Thanks!

Benjamin Peterson · Apr 27, 2010

Lothar Werzinger said:
Wow, that really MAKES a difference! Thanks a lot!

You should also try Python 2.7 or 3.1. We've recently optimized the garabage
collector for situations where many objects are being created.

Trouble with prediction code, for the life of me I can't figure out why it isnt running properly. Help would be appreciated.	0	Jul 8, 2023
How to use Densenet121 in monai	0	Feb 16, 2024
Efficient processing of large nuumeric data file	12	Jan 18, 2008
ChatBot	4	Jan 19, 2021
writing large dictionaries to file using cPickle	11	Jan 28, 2009
nested dictionary assignment goes too far	3	Jun 26, 2006
Problem with defining operator<< for std::ostream_iterator	5	May 30, 2007
Can't see the forest for the trees - when reading file, only processingfirst line	5	Apr 13, 2006

Oddity with large dictionary (several million entries)

Lothar Werzinger

Peter Otten

MRAB

Lothar Werzinger

Benjamin Peterson

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads