Initializing the number of slots in a dictionary

Discussion in 'Python' started by Jon Smirl, Aug 6, 2006.

  1. Jon Smirl

    Jon Smirl Guest

    Is there some way to tell a dictionary object that I am going to load 1M
    objects into it and have it pre-allocate enought slots to hold all of the
    entries? Thus avoiding many thousand memory allocations.

    Jon Smirl
     
    Jon Smirl, Aug 6, 2006
    #1
    1. Advertising

  2. Jon Smirl

    John Machin Guest

    Jon Smirl wrote:
    > Is there some way to tell a dictionary object that I am going to load 1M
    > objects into it and have it pre-allocate enought slots to hold all of the
    > entries?


    Not according to the manual.

    Not according to the source [as at 2.4.3]. In any case, if there were a
    back-door undocumented arg for the dict constructor, somebody would
    have read the source and spread the news.

    > Thus avoiding many thousand memory allocations.


    What gives you the impression that "many thousand memory allocations"
    are involved? {My impression is that the dict would be resized about 11
    times on its trip from size 8 to size 2M, and each resize would involve
    one allocation.]

    Do you have an application with a performance problem? If so, what
    makes you think inserting 1M items into a Python dict is contributing
    to the problem?

    Have you read the thread "Large Dictionaries" which started on/about
    2006-05-15?

    In general, what is the background to your question?

    Cheers,
    John
     
    John Machin, Aug 6, 2006
    #2
    1. Advertising

  3. Jon Smirl

    Jon Smirl Guest

    On Sun, 06 Aug 2006 15:33:30 -0700, John Machin wrote:

    > Jon Smirl wrote:
    >> Is there some way to tell a dictionary object that I am going to load 1M
    >> objects into it and have it pre-allocate enought slots to hold all of
    >> the entries?

    >
    > Not according to the manual.
    >
    > Not according to the source [as at 2.4.3]. In any case, if there were a
    > back-door undocumented arg for the dict constructor, somebody would have
    > read the source and spread the news.
    >
    >> Thus avoiding many thousand memory allocations.

    >
    > What gives you the impression that "many thousand memory allocations" are
    > involved? {My impression is that the dict would be resized about 11 times
    > on its trip from size 8 to size 2M, and each resize would involve one
    > allocation.]
    >
    > Do you have an application with a performance problem? If so, what makes
    > you think inserting 1M items into a Python dict is contributing to the
    > problem?


    I know in advance how many items will be added to the dictionary. Most
    dictionary implementations I have previously worked with are more
    efficient if they know ahead of time how big to make their tables.

    In this case I only need to do a presence test of the key, there is no
    actual data associated with the key. The table is used for detecting
    duplicate entries. Is there a more efficient to do this test that sticking
    an empty string into a dict? The keys are sha1.digest().

    >
    > Have you read the thread "Large Dictionaries" which started on/about
    > 2006-05-15?


    I'll look at this.

    > In general, what is the background to your question?


    I am in the process of modifying cvs2svn to do cvs2git. The purpose of
    this is to convert Mozilla CVS (4GB) into git format. Currently a
    conversion takes 5 hours but the code isn't finished yet.

    Other programs that attempt to process the Mozilla CVS repository take
    several days to run. But none of the other programs can successful convert
    the Mozilla repository without encountering errors. cvs2svn can successful
    convert the repository to subversion but it takes about a week. It is
    obvious that a lot of effort has been put into teaching cvs2svn how to
    deal with broken CVS repositories.

    I do practice on smaller repositories but some of the problems I am
    dealing with only occur when processing the full dataset (it contains
    weird CVS errors). In general the conversion process is completely IO
    bound. Since I am rerunning the conversion over and over anything simple
    that speeds it up is helpful. I already have 4GB RAM, it would take around
    30GB to get everything in memory.

    Dictionaries are not a big problem for me, but there are many in use and
    they have millions of items in them. But first I have to get the code for
    converting to git working, then I can worry about how long it runs. Two or
    three days is ok, a month is not ok.

    Jon Smirl
     
    Jon Smirl, Aug 7, 2006
    #3
  4. Jon Smirl

    Tim Peters Guest

    ....

    [Jon Smirl]
    > I know in advance how many items will be added to the dictionary. Most
    > dictionary implementations I have previously worked with are more
    > efficient if they know ahead of time how big to make their tables.


    Richard Jones spent considerable time investigating whether
    "pre-sizing" lists and dicts in CPython could help, at the "Need For
    Speed" sprint earlier this year. He didn't find a win worth getting;
    e.g., read the section "List pre-allocation" at:

    http://wiki.python.org/moin/NeedForSpeed/Failures

    Trying it for dicts was also part of what he did, but I don't think
    specifics about that were recorded on the Wiki. I was at the sprint,
    and hoots of triumph from Richard's direction were conspicuous by
    absence during his dict time ;-)

    > In this case I only need to do a presence test of the key, there is no
    > actual data associated with the key. The table is used for detecting
    > duplicate entries. Is there a more efficient to do this test that sticking
    > an empty string into a dict? The keys are sha1.digest().


    It's probably more common to set the value to None or 1, but it
    doesn't really matter in reality. In theory, using 1 or an empty
    string relies on the implementation accident that CPython stores those
    uniquely, while it's guaranteed that None is a singleton object.

    BTW, is there a reason to use SHA instead of MD5? I ask because the
    latter is 4 bytes shorter, and you apparently have a /lot/ of keys.

    If you're using a current version of Python, you can save some memory
    by using a builtin set object instead. The CPython implementation of
    sets is very much like its implementation of dicts, but doesn't
    consume any memory for the (non-existent) associated values. You also
    get a number of operations useful on sets (like intersection and
    union).

    > ...
    > Since I am rerunning the conversion over and over anything simple that speeds
    > it up is helpful.
    >
    > I already have 4GB RAM, it would take around 30GB to get everything in memory.
    >
    > Dictionaries are not a big problem for me, but there are many in use and
    > they have millions of items in them.


    A peculiar suggestion: if you don't need to release system resources
    cleanly at the end of a run, try doing:

    import os
    os._exit(0)

    at the end. If you have dicts with millions of entries swapped to
    disk, it /can/ consume major time just to page them all in again to
    decrement the key & value refcounts if you let "clean shutdown" code
    determine they've become trash. Bailing ungracefully can skip all
    that work (but also skips other work, like letting the platform C I/O
    library close still-open files gracefully).
     
    Tim Peters, Aug 7, 2006
    #4
  5. Jon Smirl

    Jon Smirl Guest

    On Mon, 07 Aug 2006 00:33:33 -0400, Tim Peters wrote:

    > ...
    >
    > [Jon Smirl]
    >> I know in advance how many items will be added to the dictionary. Most
    >> dictionary implementations I have previously worked with are more
    >> efficient if they know ahead of time how big to make their tables.

    >
    > Richard Jones spent considerable time investigating whether "pre-sizing"
    > lists and dicts in CPython could help, at the "Need For Speed" sprint
    > earlier this year. He didn't find a win worth getting; e.g., read the
    > section "List pre-allocation" at:
    >
    > http://wiki.python.org/moin/NeedForSpeed/Failures
    >
    > Trying it for dicts was also part of what he did, but I don't think
    > specifics about that were recorded on the Wiki. I was at the sprint, and
    > hoots of triumph from Richard's direction were conspicuous by absence
    > during his dict time ;-)
    >
    >> In this case I only need to do a presence test of the key, there is no
    >> actual data associated with the key. The table is used for detecting
    >> duplicate entries. Is there a more efficient to do this test that
    >> sticking an empty string into a dict? The keys are sha1.digest().

    >
    > It's probably more common to set the value to None or 1, but it doesn't
    > really matter in reality. In theory, using 1 or an empty string relies on
    > the implementation accident that CPython stores those uniquely, while it's
    > guaranteed that None is a singleton object.
    >
    > BTW, is there a reason to use SHA instead of MD5? I ask because the
    > latter is 4 bytes shorter, and you apparently have a /lot/ of keys.


    http://git.or.cz/index.html
    git is the source code control tool used by the Linux kernel. It is
    optimized for distributed development. git is what the kernel developers
    wrote to replace Bitkeeper after the Bitkeeper license change.

    git identifies it's change sets with sha1 hashes.

    Distributed SCM is very important when working on large projects. With a
    distributed SCM nobody needs commit privs - everyone has their own
    complete copy of the repository. To get a change into a distribution you
    need to convince someone up the heirarchy to pull your changes into
    their repository. In the Linux world Linus makes the distributions. There
    are about 10 people in the next circle and about 100 in the circle after
    that. You need to convince one of them to accept your changes, but that's
    not very hard if your code makes sense.

    Distributed also means that you can pull changes from anyone else into
    your repository. So if you want to run Reiser4 and it isn't in the main
    Linus tree yet, just pull a copy from the namesys git tree into your local
    tree and everything will get merged.

    Python is using Subversion which is way better than CVS, but Subversion is
    still organized around a central repository with commit privs. If that
    repository disappears in an earthquake Python may be in trouble.

    If you give git a try you will also notice that it is way faster than
    subversion.

    > If you're using a current version of Python, you can save some memory by
    > using a builtin set object instead. The CPython implementation of sets is
    > very much like its implementation of dicts, but doesn't consume any memory
    > for the (non-existent) associated values. You also get a number of
    > operations useful on sets (like intersection and union).


    Sets may be a good option, I'll adjust the code for the next run.

    >> ...
    >> Since I am rerunning the conversion over and over anything simple that
    >> speeds it up is helpful.
    >>
    >> I already have 4GB RAM, it would take around 30GB to get everything in
    >> memory.
    >>
    >> Dictionaries are not a big problem for me, but there are many in use and
    >> they have millions of items in them.

    >
    > A peculiar suggestion: if you don't need to release system resources
    > cleanly at the end of a run, try doing:
    >
    > import os
    > os._exit(0)
    >
    > at the end. If you have dicts with millions of entries swapped to disk,
    > it /can/ consume major time just to page them all in again to decrement
    > the key & value refcounts if you let "clean shutdown" code determine
    > they've become trash. Bailing ungracefully can skip all that work (but
    > also skips other work, like letting the platform C I/O library close
    > still-open files gracefully).


    Jon Smirl
     
    Jon Smirl, Aug 7, 2006
    #5
  6. Jon Smirl

    Fuzzyman Guest

    Jon Smirl wrote:
    > On Sun, 06 Aug 2006 15:33:30 -0700, John Machin wrote:

    [snip..]
    > >
    > > Do you have an application with a performance problem? If so, what makes
    > > you think inserting 1M items into a Python dict is contributing to the
    > > problem?

    >
    > I know in advance how many items will be added to the dictionary. Most
    > dictionary implementations I have previously worked with are more
    > efficient if they know ahead of time how big to make their tables.
    >
    > In this case I only need to do a presence test of the key, there is no
    > actual data associated with the key. The table is used for detecting
    > duplicate entries. Is there a more efficient to do this test that sticking
    > an empty string into a dict? The keys are sha1.digest().


    Possibly a naive question - but would using sets be more efficient ?

    They are generally used for the sort of job you are describing (testing
    for duplicates rather than storing data associated with a key).

    We did some limited performance tests at Resolver Systems and found use
    of sets and dictionaries to be almost exactly hte same speed - but
    there could be memory advantages for you. Performance is also likely to
    be different for vast data sets, but I understand the hashing algorithm
    to be very similar for both...

    Fuzzyman
    http://www.voidspace.org.uk/python/index.shtml
     
    Fuzzyman, Aug 7, 2006
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Frank Bossy
    Replies:
    1
    Views:
    487
    Victor Bazarov
    Jul 9, 2003
  2. Boris Boutillier
    Replies:
    7
    Views:
    333
    Thomas Heller
    Oct 17, 2003
  3. Jaroslaw Zabiello

    Zope, ZPT and slots

    Jaroslaw Zabiello, Sep 13, 2005, in forum: Python
    Replies:
    1
    Views:
    373
    bruno modulix
    Sep 13, 2005
  4. tin gherdanarra

    slots? SLOTS?

    tin gherdanarra, Oct 12, 2005, in forum: Python
    Replies:
    2
    Views:
    2,337
    Peter Hansen
    Oct 13, 2005
  5. Christian Bruckhoff

    Problems with Signals & Slots of QT

    Christian Bruckhoff, Sep 24, 2006, in forum: C++
    Replies:
    5
    Views:
    431
    loufoque
    Sep 24, 2006
Loading...

Share This Page