Oddity with large dictionary (several million entries)

Discussion in 'Python' started by Lothar Werzinger, Apr 27, 2010.

  1. Hi,

    I am trying to load files into a dictionary for analysis. the size of the
    dictionary will grow quite large (several million entries) and as inserting
    into a dictionary is roughly O(n) I figured if I loaded each file into it's
    own dictionary it would speed things up. However it did not.

    So I decided to write a small test program (attached)

    As you can see I am inserting one million entries a time into a map. I ran
    the tests where I put all three million entries into one map and one where I
    put one million each into it's own map.

    What I would have expected is that if I insert one million into it's own map
    the time to do that would be roughly constant for each map. Interestingly it
    is not. It's about the same as if I load everything into one map.

    Oh and I have 4G of RAM and the test consumes about 40% at it's max. I even
    run the test on one of our servers with 64G of RAM, so I can rule out
    swapping as the issue.

    Can anyone explain this oddity? Any insight is highly appreciated.

    Here's the output of the test runs:

    $ ./dicttest.py -t 0
    Inserting into one map
    Inserting 1000000 keys lasted 0:00:26 (38019 1/s)
    len(map) 1000000
    Inserting 1000000 keys lasted 0:01:17 (12831 1/s)
    len(map) 2000000
    Inserting 1000000 keys lasted 0:02:23 (6972 1/s)
    len(map) 3000000
    total 3000000

    $ ./dicttest.py -t 1
    Inserting into three maps
    Inserting 1000000 keys lasted 0:00:32 (30726 1/s)
    len(map) 1000000
    Inserting 1000000 keys lasted 0:01:29 (11181 1/s)
    len(map) 1000000
    Inserting 1000000 keys lasted 0:02:23 (6957 1/s)
    len(map) 1000000
    total 3000000


    Thanks,
    Lothar

    ,----[ /home/lothar/tmp/dicttest.py ]
    | #!/usr/bin/python
    | # -*- coding: utf-8 -*-
    |
    | import datetime
    | import optparse
    | import sys
    | import time
    |
    |
    |
    |
    | def fillDict(map, nop, num, guid):
    | before = time.time()
    |
    | for i in xrange(0, num):
    | key = (i, guid)
    | if not nop:
    | map[key] = ([], {})
    |
    | after = time.time()
    | elapsed = (after - before)
    | if elapsed <= 0:
    | divide = 1.0
    | else:
    | divide = elapsed
    | elapsedTime = datetime.timedelta(seconds=int(elapsed))
    | print("Inserting %d keys lasted %s (%d 1/s)" % (num, elapsedTime, (num /
    | divide))) print("len(map) %d" % (len(map)))
    |
    |
    | def test0(nop, num):
    | print("Inserting into one map")
    | map = {}
    | fillDict(map, nop, num, "0561c83c-9675-4e6f-bedc-86bcb6acfd71")
    | fillDict(map, nop, num, "0561c83c-9675-4e6f-bedc-86bcb6acfd72")
    | fillDict(map, nop, num, "0561c83c-9675-4e6f-bedc-86bcb6acfd73")
    | print("total %d" % (len(map)))
    |
    |
    | def test1(nop, num):
    | print("Inserting into three maps")
    | map1 = {}
    | map2 = {}
    | map3 = {}
    | fillDict(map1, nop, num, "0561c83c-9675-4e6f-bedc-86bcb6acfd71")
    | fillDict(map2, nop, num, "0561c83c-9675-4e6f-bedc-86bcb6acfd72")
    | fillDict(map3, nop, num, "0561c83c-9675-4e6f-bedc-86bcb6acfd73")
    | total = 0
    | for map in [map1, map2, map3]:
    | total += len(map)
    | print("total %d" % (total))
    |
    |
    |
    | if __name__ == "__main__":
    | usage = "USAGE: %prog [options]"
    | description="test"
    | version="%prog 1.0"
    |
    | parser = optparse.OptionParser(usage=usage, version=version,
    | description=description) parser.add_option(
    | "-t",
    | "--testnum",
    | action="store",
    | dest="testnum",
    | help="the number of the test to execute",
    | type="int",
    | default=1
    | )
    | parser.add_option(
    | "-i",
    | "--iterations",
    | action="store",
    | dest="iterations",
    | help="the number of iterations",
    | type="int",
    | default=1000000
    | )
    | parser.add_option(
    | "-n",
    | "--nop",
    | action="store_true",
    | dest="nop",
    | help="don't store in the dictionary only load and parse",
    | default=False
    | )
    |
    | (options, args) = parser.parse_args()
    |
    | testmap = {
    | 0:test0,
    | 1:test1,
    | }
    |
    | test = testmap[options.testnum]
    |
    | test(options.nop, options.iterations)
    `----
     
    Lothar Werzinger, Apr 27, 2010
    #1
    1. Advertising

  2. Lothar Werzinger

    Peter Otten Guest

    Lothar Werzinger wrote:

    > I am trying to load files into a dictionary for analysis. the size of the
    > dictionary will grow quite large (several million entries) and as
    > inserting into a dictionary is roughly O(n) I figured if I loaded each
    > file into it's own dictionary it would speed things up. However it did
    > not.
    >
    > So I decided to write a small test program (attached)
    >
    > As you can see I am inserting one million entries a time into a map. I ran
    > the tests where I put all three million entries into one map and one where
    > I put one million each into it's own map.
    >
    > What I would have expected is that if I insert one million into it's own
    > map the time to do that would be roughly constant for each map.
    > Interestingly it is not. It's about the same as if I load everything into
    > one map.
    >
    > Oh and I have 4G of RAM and the test consumes about 40% at it's max. I
    > even run the test on one of our servers with 64G of RAM, so I can rule out
    > swapping as the issue.
    >
    > Can anyone explain this oddity? Any insight is highly appreciated.


    When you are creating objects like there is no tomorrow Python's cyclic
    garbage collections often takes a significant amount of time. The first
    thing I'd try is therefore switching it off with

    import gc
    gc.disable()

    Peter
     
    Peter Otten, Apr 27, 2010
    #2
    1. Advertising

  3. Lothar Werzinger

    MRAB Guest

    Lothar Werzinger wrote:
    > Hi,
    >
    > I am trying to load files into a dictionary for analysis. the size of the
    > dictionary will grow quite large (several million entries) and as inserting
    > into a dictionary is roughly O(n) I figured if I loaded each file into it's
    > own dictionary it would speed things up. However it did not.
    >
    > So I decided to write a small test program (attached)
    >
    > As you can see I am inserting one million entries a time into a map. I ran
    > the tests where I put all three million entries into one map and one where I
    > put one million each into it's own map.
    >
    > What I would have expected is that if I insert one million into it's own map
    > the time to do that would be roughly constant for each map. Interestingly it
    > is not. It's about the same as if I load everything into one map.
    >
    > Oh and I have 4G of RAM and the test consumes about 40% at it's max. I even
    > run the test on one of our servers with 64G of RAM, so I can rule out
    > swapping as the issue.
    >
    > Can anyone explain this oddity? Any insight is highly appreciated.
    >
    > Here's the output of the test runs:
    >
    > $ ./dicttest.py -t 0
    > Inserting into one map
    > Inserting 1000000 keys lasted 0:00:26 (38019 1/s)
    > len(map) 1000000
    > Inserting 1000000 keys lasted 0:01:17 (12831 1/s)
    > len(map) 2000000
    > Inserting 1000000 keys lasted 0:02:23 (6972 1/s)
    > len(map) 3000000
    > total 3000000
    >
    > $ ./dicttest.py -t 1
    > Inserting into three maps
    > Inserting 1000000 keys lasted 0:00:32 (30726 1/s)
    > len(map) 1000000
    > Inserting 1000000 keys lasted 0:01:29 (11181 1/s)
    > len(map) 1000000
    > Inserting 1000000 keys lasted 0:02:23 (6957 1/s)
    > len(map) 1000000
    > total 3000000
    >

    [snip]
    Inserting into a Python dict is O(1).
     
    MRAB, Apr 27, 2010
    #3
  4. Peter Otten wrote:

    > Lothar Werzinger wrote:
    >> Can anyone explain this oddity? Any insight is highly appreciated.

    >
    > When you are creating objects like there is no tomorrow Python's cyclic
    > garbage collections often takes a significant amount of time. The first
    > thing I'd try is therefore switching it off with
    >
    > import gc
    > gc.disable()
    >
    > Peter


    Wow, that really MAKES a difference! Thanks a lot!

    $ ~/tmp/dicttest.py -t 0
    Inserting into one map
    Inserting 1000000 keys lasted 0:00:01 (960152 1/s)
    len(map) 1000000
    Inserting 1000000 keys lasted 0:00:01 (846416 1/s)
    len(map) 2000000
    Inserting 1000000 keys lasted 0:00:04 (235241 1/s)
    len(map) 3000000
    total 3000000

    $ ~/tmp/dicttest.py -t 1
    Inserting into three maps
    Inserting 1000000 keys lasted 0:00:01 (973344 1/s)
    len(map) 1000000
    Inserting 1000000 keys lasted 0:00:00 (1011303 1/s)
    len(map) 1000000
    Inserting 1000000 keys lasted 0:00:00 (1021796 1/s)
    len(map) 1000000
    total 3000000
    <~/beacon>


    Thanks!
    --
    Lothar
     
    Lothar Werzinger, Apr 27, 2010
    #4
  5. Lothar Werzinger <lothar <at> tradescape.biz> writes:
    > Wow, that really MAKES a difference! Thanks a lot!


    You should also try Python 2.7 or 3.1. We've recently optimized the garabage
    collector for situations where many objects are being created.
     
    Benjamin Peterson, Apr 28, 2010
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Ian Suttle
    Replies:
    2
    Views:
    2,373
    Wolfram Rittmeyer
    Aug 27, 2003
  2. Ilias Lazaridis
    Replies:
    6
    Views:
    484
    Ilias Lazaridis
    Feb 21, 2006
  3. Kevin Walzer
    Replies:
    1
    Views:
    357
    jim-on-linux
    Nov 27, 2006
  4. Don Bruder
    Replies:
    3
    Views:
    1,018
    spikeysnack
    Aug 3, 2010
  5. Alex DeCaria
    Replies:
    2
    Views:
    92
    Alex DeCaria
    Feb 2, 2007
Loading...

Share This Page