I should have been more specific about possible fixes.
python2.5 -m timeit 'gc.disable();l=[(i,) for i in range(2000000)]'
10 loops, best of 3: 662 msec per loop
python2.5 -m timeit 'gc.enable();l=[(i,) for i in range(2000000)]'
10 loops, best of 3: 15.2 sec per loop
In the latter case, thegarbagecollector apparently consumes about
97% of the overall time.
In a related thread onhttp://bugs.python.org/issue2607
Amaury Forgeot d'Arc suggested a setting of the GC
thresholds that actually solves the problem for me:
Disabling the gc may not be a good idea in a real application; I suggest
you to play with the gc.set_threshold function and set larger values, at
least while building the dictionary. (700, 1000, 10) seems to yield good
results.
python2.5 -m timeit 'gc.set_threshold(700,1000,10);l=[(i,) for i in range(2000000)]'
10 loops, best of 3: 658 msec per loop
which made me suggest to use these as defaults, but then
Martin v. Löwis wrote that
No, the defaults are correct for typical applications.
At that point I felt lost and as the general wish in that thread was
to move
discussion to comp.lang.python, I brought it up here, in a modified
and simplified form.
I would suggest to configure the default behaviour of thegarbage
collector in such a way that this squared complexity is avoided
without requiring specific knowledge and intervention by the user. Not
being an expert in these details I would like to ask the gurus how
this could be done.
I hope this should be at least technically possible, whether it is
really desirable or important for a default installation of Python
could then be discussed once the disadvantages of such a setting would
be apparent.
I still don't see what is so good about defaults that lead to O(N*N)
computation for a O(N) problem, and I like Amaury's suggestion a lot,
so I would like to see comments on its disadvantages. Please don't
tell me that O(N*N) is good enough. For N>1E7 it isn't.
About some other remarks made here:
I think the linguists of the world should write better automated
translation systems. Not being an expert in these details I would like
to ask the gurus how it could be done.
I fully agree, and that is why I (as someone involved with MT) would
prefer to focus on MT and not on memory allocation issues, and by the
way, this is why I normally prefer to work in Python, as it normally
lets me focus on the problem instead of the tool.
There are going to be pathological cases in any memory allocation
scheme. The fact that you have apparently located one calls for you to
change your approach, not for the finely-tuned well-conceivedgarbage
collection scheme to be abandoned for your convenience.
I do not agree at all. Having to deal with N>1E7 objects is IMHO not
pathological, it is just daily practice in data-oriented work (we're
talking about deriving models from Gigawords, not about toy systems).
If you had made a definite proposal that could have been evaluated you
request would perhaps have seemed a little more approachable.
I feel it is ok to describe a generic problem without having found
the answer yet.
(If you disagree: Where are your definite proposals wrt. MT ? ;-)
But I admit, I should have brought up Amaury's definite proposal
right away.
A question often asked ... of people who are creating
very large data structures in Python is "Why are you doing that?"
That is, you should consider whether some kind of database solution
would be better. You mention lots of dicts--it sounds like some
balanced B-trees with disk loading on demand could be a good choice.
I do it because I have to count frequencies of pairs of things that
appear in real data. Extremely simple stuff, only interesting because
of the size of the data set. After Amaury's hint to switch GC
temporarily
off I can count 100M pairs tokens (18M pair types) in about 15
minutes,
which is very cool, and I can do it in only a few lines of code, which
is even cooler.
Any approach that would touch the disk would be too slow for me, and
bringing in complicated datastructures by myself would distract me
too much from what I need to do. That's exactly why I use Python;
it has lots of highly fine-tuned complicated stuff behind the scenes,
that is just doing the right thing, so I should not have to (and I
could
not) re-invent this myself.
Thanks for the helpful comments,
Andreas