....
[Jon Smirl]
I know in advance how many items will be added to the dictionary. Most
dictionary implementations I have previously worked with are more
efficient if they know ahead of time how big to make their tables.
Richard Jones spent considerable time investigating whether
"pre-sizing" lists and dicts in CPython could help, at the "Need For
Speed" sprint earlier this year. He didn't find a win worth getting;
e.g., read the section "List pre-allocation" at:
http://wiki.python.org/moin/NeedForSpeed/Failures
Trying it for dicts was also part of what he did, but I don't think
specifics about that were recorded on the Wiki. I was at the sprint,
and hoots of triumph from Richard's direction were conspicuous by
absence during his dict time ;-)
In this case I only need to do a presence test of the key, there is no
actual data associated with the key. The table is used for detecting
duplicate entries. Is there a more efficient to do this test that sticking
an empty string into a dict? The keys are sha1.digest().
It's probably more common to set the value to None or 1, but it
doesn't really matter in reality. In theory, using 1 or an empty
string relies on the implementation accident that CPython stores those
uniquely, while it's guaranteed that None is a singleton object.
BTW, is there a reason to use SHA instead of MD5? I ask because the
latter is 4 bytes shorter, and you apparently have a /lot/ of keys.
If you're using a current version of Python, you can save some memory
by using a builtin set object instead. The CPython implementation of
sets is very much like its implementation of dicts, but doesn't
consume any memory for the (non-existent) associated values. You also
get a number of operations useful on sets (like intersection and
union).
...
Since I am rerunning the conversion over and over anything simple that speeds
it up is helpful.
I already have 4GB RAM, it would take around 30GB to get everything in memory.
Dictionaries are not a big problem for me, but there are many in use and
they have millions of items in them.
A peculiar suggestion: if you don't need to release system resources
cleanly at the end of a run, try doing:
import os
os._exit(0)
at the end. If you have dicts with millions of entries swapped to
disk, it /can/ consume major time just to page them all in again to
decrement the key & value refcounts if you let "clean shutdown" code
determine they've become trash. Bailing ungracefully can skip all
that work (but also skips other work, like letting the platform C I/O
library close still-open files gracefully).