Python dictionary size/entry limit?

Discussion in 'Python' started by intelliminer, Feb 21, 2009.

  1. intelliminer

    intelliminer Guest

    I wrote a script to process textual data and extract phrases from
    them, storing these phrases in a dictionary. It encounters a
    MemoryError when there are about 11.18M keys in the dictionary, and
    the size is about 1.5GB. I tried multiple times, and the error occurs
    everytime at exactly the same place (with the same number of keys in
    the dict). I then split the dictionary into two using a simple

    if str[0]<='m':

    #use dict...

    And it worked fine. The total size of the two dictionaries well
    exceeded 2GB yet no MemoryError occured.

    I have 1GB of pysical memory and 3GB in pagefile. Is there a limit to
    the size or number of entries that a single dictionary can possess? By
    searching on the web I can't find a clue why this problem occurs.
    intelliminer, Feb 21, 2009
    1. Advertisements

  2. intelliminer

    intelliminer Guest

    Yes, it's winxp, I forgot to mention it.
    Thanks for the reply.
    intelliminer, Feb 21, 2009
    1. Advertisements

  3. Python dicts are only limited by what your OS returns as free memory.
    However, when a dict grows, it needs to resize, which means that it has to
    create a bigger copy of itself and redistribute the keys. For a dict that
    is already 1.5GB big, this can temporarily eat a lot more memory than you
    have, at least more than two times as much as the size of the dict itself.

    You may be better served with one of the dbm databases that come with
    Python. They live on-disk but do the usual in-memory caching. They'll
    likely perform a lot better than your OS level swap file.

    Stefan Behnel, Feb 21, 2009
  4. intelliminer

    Sean Guest

    the bsddb module has the feature that access to the database uses the
    same methods as python dictionaries.

    Sean, Feb 22, 2009
  5. intelliminer

    intelliminer Guest

    Ummm, I didn't know about the dbm databases. It seems there are many
    modules for this kind of tasks: gdbm, berkeley db, cdb, etc. I'm
    needing to implement
    a constant hashtable with a large number of keys, but only a small
    fraction of them
    will be accessed frequently, the read speed is crucial. It would be
    ideal if
    the implementation caches all the frequently used key/value pairs in
    memory. Which
    module should I use? And is there a way to specify the amount of
    memory it uses for caching?
    BTW, the target platform is Linux.

    Thank you.
    intelliminer, Feb 22, 2009
  6. Their interfaces are mostly compatible to Python dicts. Just keep your code
    independent at the beginning and benchmark it against all of them.

    Stefan Behnel, Feb 22, 2009
  7. On a 32-bit system, the dictionary can have up to 2**31 slots,
    meaning that the maximum number of keys is slightly smaller
    (about 2**30).

    As others have pointed out, Python's or Windows' memory management
    might have led to fragmentation, so that no contiguous piece of
    memory to hold the resized dictionary can be found.

    One solution to that problem would be to use a 64-bit system.

    Martin v. Löwis, Feb 22, 2009
  8. Which, in practice, means that the size is limited by the available memory.

    Stefan Behnel, Feb 25, 2009
  9. On a 32-bit system, the dictionary can have up to 2**31 slots,
    Right. Each slot takes 12 bytes, so the storage for the slots alone
    would consume all available address space.

    From that point of view, you can't possibly have more than 314M slots
    in a 32-bit address space (roughly 2**28).

    Martin v. Löwis, Feb 25, 2009
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.