Tremendous slowdown due to garbage collection

A

andreas.eisele

In an application dealing with very large text files, I need to create
dictionaries indexed by tuples of words (bi-, tri-, n-grams) or nested
dictionaries. The number of different data structures in memory grows
into orders beyond 1E7.

It turns out that the default behaviour of Python is not very suitable
for such a situation, as garbage collection occasionally traverses all
objects in memory (including all tuples) in order to find out which
could be collected. This leads to the sitation that creating O(N)
objects effectively takes O(N*N) time.
Although this can be cured by switching garbage collection off before
the data structures are built and back on afterwards, it may easily
lead a user not familiar with the fine details of garbage collection
behaviour to the impression of weak scalability, which would be a
pity, as the real problem is much more specific and can be cured.

The problem is already clearly visible for 1M objects, but for larger
numbers it gets much more pronounced. Here is a very simple example
that displays the problem.
python2.5 -m timeit 'gc.disable();l=[(i,) for i in range(1000000)]'
10 loops, best of 3: 329 msec per loop
python2.5 -m timeit 'gc.enable();l=[(i,) for i in range(1000000)]'
10 loops, best of 3: 4.06 sec per loop
python2.5 -m timeit 'gc.disable();l=[(i,) for i in range(2000000)]'
10 loops, best of 3: 662 msec per loop
python2.5 -m timeit 'gc.enable();l=[(i,) for i in range(2000000)]'
10 loops, best of 3: 15.2 sec per loop

In the latter case, the garbage collector apparently consumes about
97% of the overall time.

I would suggest to configure the default behaviour of the garbage
collector in such a way that this squared complexity is avoided
without requiring specific knowledge and intervention by the user. Not
being an expert in these details I would like to ask the gurus how
this could be done.
I hope this should be at least technically possible, whether it is
really desirable or important for a default installation of Python
could then be discussed once the disadvantages of such a setting would
be apparent.

Thanks a lot for your consideration, and best regards,
Andreas

------
Dr. Andreas Eisele, Senior Researcher
DFKI GmbH, Language Technology Lab, (e-mail address removed)
Saarland University Computational Linguistics
Stuhlsatzenhausweg 3 tel: +49-681-302-5285
D-66123 Saarbrücken fax: +49-681-302-5338
 
S

Steve Holden

[...]
I would suggest to configure the default behaviour of the garbage
collector in such a way that this squared complexity is avoided
without requiring specific knowledge and intervention by the user. Not
being an expert in these details I would like to ask the gurus how
this could be done.
I hope this should be at least technically possible, whether it is
really desirable or important for a default installation of Python
could then be discussed once the disadvantages of such a setting would
be apparent.

Thanks a lot for your consideration, and best regards,
Andreas
I think the linguists of the world should write better automated
translation systems. Not being an expert in these details I would like
to ask the gurus how it could be done.

There are going to be pathological cases in any memory allocation
scheme. The fact that you have apparently located one calls for you to
change your approach, not for the finely-tuned well-conceived garbage
collection scheme to be abandoned for your convenience.

If you had made a definite proposal that could have been evaluated you
request would perhaps have seemed a little more approachable.

You might want to consider trying Jython, or IronPython, or PyPy, each
of which comes with a different flavor of garbage collection, to see if
one of the other approaches suits you better.

regards
Steve
 
S

Steve Holden

[...]
I would suggest to configure the default behaviour of the garbage
collector in such a way that this squared complexity is avoided
without requiring specific knowledge and intervention by the user. Not
being an expert in these details I would like to ask the gurus how
this could be done.
I hope this should be at least technically possible, whether it is
really desirable or important for a default installation of Python
could then be discussed once the disadvantages of such a setting would
be apparent.

Thanks a lot for your consideration, and best regards,
Andreas
I think the linguists of the world should write better automated
translation systems. Not being an expert in these details I would like
to ask the gurus how it could be done.

There are going to be pathological cases in any memory allocation
scheme. The fact that you have apparently located one calls for you to
change your approach, not for the finely-tuned well-conceived garbage
collection scheme to be abandoned for your convenience.

If you had made a definite proposal that could have been evaluated you
request would perhaps have seemed a little more approachable.

You might want to consider trying Jython, or IronPython, or PyPy, each
of which comes with a different flavor of garbage collection, to see if
one of the other approaches suits you better.

regards
Steve
 
C

Carl Banks

I would suggest to configure the default behaviour of the garbage
collector in such a way that this squared complexity is avoided
without requiring specific knowledge and intervention by the user. Not
being an expert in these details I would like to ask the gurus how
this could be done.

Well, the garbage collector activates whenever allocations exceed
deallocations by a certain amount, which for obvious reasons is the
reason for the bottleneck wh

My first stab at a suggestion would be to adjust the threshold
depending on how successful it is. So if a garbage collection run
collects no objects, double (or whatever) the threshold until the next
run.

More formulaicly, if your threshold is X, and a garbage collection
pass collects Y objects, the the threshold for the next pass would be
something like 2*X/(Y/X+1). Then you can create your very large data
structure in only O(N log N) time.


But as Steve Holden said, it'll probably mess someone else up.
There's not much you can do with a garbage collector to make it please
everyone.


A question often asked--and I am not a big a fan of these sorts of
questions, but it is worth thinking about--of people who are creating
very large data structures in Python is "Why are you doing that?"
That is, you should consider whether some kind of database solution
would be better. You mention lots of dicts--it sounds like some
balanced B-trees with disk loading on demand could be a good choice.


Carl Banks
 
A

andreas.eisele

I should have been more specific about possible fixes.
python2.5 -m timeit 'gc.disable();l=[(i,) for i in range(2000000)]'

10 loops, best of 3: 662 msec per loop
python2.5 -m timeit 'gc.enable();l=[(i,) for i in range(2000000)]'

10 loops, best of 3: 15.2 sec per loop

In the latter case, the garbage collector apparently consumes about
97% of the overall time.

In a related thread on http://bugs.python.org/issue2607
Amaury Forgeot d'Arc suggested a setting of the GC
thresholds that actually solves the problem for me:
Disabling the gc may not be a good idea in a real application; I suggest
you to play with the gc.set_threshold function and set larger values, at
least while building the dictionary. (700, 1000, 10) seems to yield good
results.
python2.5 -m timeit 'gc.set_threshold(700,1000,10);l=[(i,) for i in range(2000000)]'
10 loops, best of 3: 658 msec per loop

which made me suggest to use these as defaults, but then
Martin v. Löwis wrote that
No, the defaults are correct for typical applications.

At that point I felt lost and as the general wish in that thread was
to move
discussion to comp.lang.python, I brought it up here, in a modified
and simplified form.
I would suggest to configure the default behaviour of the garbage
collector in such a way that this squared complexity is avoided
without requiring specific knowledge and intervention by the user. Not
being an expert in these details I would like to ask the gurus how
this could be done.
I hope this should be at least technically possible, whether it is
really desirable or important for a default installation of Python
could then be discussed once the disadvantages of such a setting would
be apparent.

I still don't see what is so good about defaults that lead to O(N*N)
computation for a O(N) problem, and I like Amaury's suggestion a lot,
so I would like to see comments on its disadvantages. Please don't
tell me that O(N*N) is good enough. For N>1E7 it isn't.

About some other remarks made here:
I think the linguists of the world should write better automated
translation systems. Not being an expert in these details I would like
to ask the gurus how it could be done.

I fully agree, and that is why I (as someone involved with MT) would
prefer to focus on MT and not on memory allocation issues, and by the
way, this is why I normally prefer to work in Python, as it normally
lets me focus on the problem instead of the tool.
There are going to be pathological cases in any memory allocation
scheme. The fact that you have apparently located one calls for you to
change your approach, not for the finely-tuned well-conceived garbage
collection scheme to be abandoned for your convenience.

I do not agree at all. Having to deal with N>1E7 objects is IMHO not
pathological, it is just daily practice in data-oriented work (we're
talking about deriving models from Gigawords, not about toy systems).
If you had made a definite proposal that could have been evaluated you
request would perhaps have seemed a little more approachable.

I feel it is ok to describe a generic problem without having found
the answer yet.
(If you disagree: Where are your definite proposals wrt. MT ? ;-)
But I admit, I should have brought up Amaury's definite proposal
right away.
A question often asked ... of people who are creating
very large data structures in Python is "Why are you doing that?"
That is, you should consider whether some kind of database solution
would be better. You mention lots of dicts--it sounds like some
balanced B-trees with disk loading on demand could be a good choice.

I do it because I have to count frequencies of pairs of things that
appear in real data. Extremely simple stuff, only interesting because
of the size of the data set. After Amaury's hint to switch GC
temporarily
off I can count 100M pairs tokens (18M pair types) in about 15
minutes,
which is very cool, and I can do it in only a few lines of code, which
is even cooler.
Any approach that would touch the disk would be too slow for me, and
bringing in complicated datastructures by myself would distract me
too much from what I need to do. That's exactly why I use Python;
it has lots of highly fine-tuned complicated stuff behind the scenes,
that is just doing the right thing, so I should not have to (and I
could
not) re-invent this myself.

Thanks for the helpful comments,
Andreas
 
J

John Nagle

In an application dealing with very large text files, I need to create
dictionaries indexed by tuples of words (bi-, tri-, n-grams) or nested
dictionaries. The number of different data structures in memory grows
into orders beyond 1E7.

It turns out that the default behaviour of Python is not very suitable
for such a situation, as garbage collection occasionally traverses all
objects in memory (including all tuples) in order to find out which
could be collected. This leads to the sitation that creating O(N)
objects effectively takes O(N*N) time.

Do your data structures need garbage collection? CPython is
a reference counted system with garbage collection as a backup
to catch cycles. Try using weak back-pointers in your data
structures to avoid creating cycles.

John Nagle
 
C

Christian Heimes

which made me suggest to use these as defaults, but then
Martin v. Löwis wrote that


At that point I felt lost and as the general wish in that thread was
to move
discussion to comp.lang.python, I brought it up here, in a modified
and simplified form.

Martin said that the default settings for the cyclic gc works for most
people. Your test case has found a pathologic corner case which is *not*
typical for common application but typical for an artificial benchmark.
Python is optimized for regular apps, not for benchmark (like some video
drivers).

By the way you shouldn't use range for large ranges of more than a
thousand items. xrange() should be faster and it will definitely use
much less memory - and memory Python 2.5 and older will never release
again. I'm going to fix the issue for Python 2.6 and 3.0.

Christian
 
A

andreas.eisele

Sorry, I have to correct my last posting again:
Disabling the gc may not be a good idea in a real application; I suggest
you to play with the gc.set_threshold function and set larger values, at
least while building the dictionary. (700, 1000, 10) seems to yield good
results.
python2.5 -m timeit 'gc.set_threshold(700,1000,10);l=[(i,) for i in range(2000000)]'

10 loops, best of 3: 658 msec per loop

Looks nice, but it is pointless, as timeit disables gc by default,
and the set_threshold hence has no effect. In order to see the real
effect of this, I need to say
python2.5 -m timeit 'gc.enable();gc.set_threshold(700,1000,10);l=[(i,) for i in range(2000000)]'

which gives me

10 loops, best of 3: 1.26 sec per loop

Still a factor of more than 10 faster than the default settings.
which made me suggest to use these as defaults, but then
Martin v. Löwis wrote that

Complex topic...
 
A

andreas.eisele

Martin said that the default settings for the cyclic gc works for most
people.

I agree.
Your test case has found a pathologic corner case which is *not*
typical for common application but typical for an artificial benchmark.

I agree that my "corner" is not typical, but I strongly disagree with
the classification as pathological. The only feature of my test case
that is not typical is the huge number of distinct objects that are
allocated. I admit that 1E7 objects is today still fairly untypical,
but there is nothing pathological about it, it is just bigger. I is
about as pathological as a file size >2G, which a few years ago seemed
so outrageous that no OS bothered to support it, but is fairly common
nowadays, so that a lack of support would appear as an arbitrary and
unmotivated limitation nowadays. We all enjoy seeing Python adopted on
a large scale and used by a broad community, so we should not accept
arbitrary size limits. You could call a string with more than 2GB
pathological, but I very much appreciate the fact that Python supports
such strings for the few cases where they are needed (on a 64 bit
architecture). Now a O(N*N) effort for large numbers of objects isn't
such a hard limit, but in practice boils down to the same effect, that
people cannot use Python in such circumstances. I would prefer it very
much if such "soft limits" could be avoided as well.

Given there is a fairly simple workaround (thanks again to Amaury!),
the issue is not urgent, but I still think it is important in the long
run.
Python is optimized for regular apps, not for benchmark (like some video
drivers).

I still think it would be worthwhile to support very large numbers of
objects in a way that they can just be used, without knowledge of
special tricks, and I would be fairly optimistic that those who have
designed the current GC schemes could generalize them slightly so that
these marginal cases will work better without imposing a penalty on
the more typical cases.
By the way you shouldn't use range for large ranges of more than a
thousand items. xrange() should be faster and it will definitely use
much less memory - and memory Python 2.5 and older will never release
again. I'm going to fix the issue for Python 2.6 and 3.0.

Thanks for this hint, and for the work on the newer versions. This is
very much appreciated.

Andreas
 
S

Steve Holden

I agree.


I agree that my "corner" is not typical, but I strongly disagree with
the classification as pathological. The only feature of my test case
that is not typical is the huge number of distinct objects that are
allocated. I admit that 1E7 objects is today still fairly untypical,
but there is nothing pathological about it, it is just bigger. I is
about as pathological as a file size >2G, which a few years ago seemed
so outrageous that no OS bothered to support it, but is fairly common
nowadays, so that a lack of support would appear as an arbitrary and
unmotivated limitation nowadays. We all enjoy seeing Python adopted on
a large scale and used by a broad community, so we should not accept
arbitrary size limits. You could call a string with more than 2GB
pathological, but I very much appreciate the fact that Python supports
such strings for the few cases where they are needed (on a 64 bit
architecture). Now a O(N*N) effort for large numbers of objects isn't
such a hard limit, but in practice boils down to the same effect, that
people cannot use Python in such circumstances. I would prefer it very
much if such "soft limits" could be avoided as well.

Given there is a fairly simple workaround (thanks again to Amaury!),
the issue is not urgent, but I still think it is important in the long
run.


I still think it would be worthwhile to support very large numbers of
objects in a way that they can just be used, without knowledge of
special tricks, and I would be fairly optimistic that those who have
designed the current GC schemes could generalize them slightly so that
these marginal cases will work better without imposing a penalty on
the more typical cases.
I believe you are making surmises outside your range of competence
there. While your faith in the developers is touching, the garbage
collection scheme is something that has received a lot of attention with
respect to performance under typical workloads over the years.

By the way, the term "pathological" (which I believe I also used about
your application) doesn't imply anything bad about your program: it's
merely a shorthand way of saying that it triggers this undesirable
behavior in a garbage collector that is mostly satisfactory for other
purposes.
Thanks for this hint, and for the work on the newer versions. This is
very much appreciated.
Anyway, I'm glad Amaury's workaround has solved your problem at least
temporarily. Your type of workload may become more typical over time,
but at the moment you are probably somewhat ahead of the mainstream
here. If you know there can't be any recoverable cycles then you are
probably better off just telling the garbage collector that you know
better than it does.

regards
Steve
 
P

Paul Rubin

Steve Holden said:
I believe you are making surmises outside your range of competence
there. While your faith in the developers is touching, the garbage
collection scheme is something that has received a lot of attention
with respect to performance under typical workloads over the years.

Really, Python's cyclic gc is quite crude and should be upgraded to
something that doesn't fall into that quadratic behavior. There was
some fairly detailed discussion of this in another thread last time
the subject came up.
 
S

Steve Holden

Paul said:
Really, Python's cyclic gc is quite crude and should be upgraded to
something that doesn't fall into that quadratic behavior. There was
some fairly detailed discussion of this in another thread last time
the subject came up.

I'll take your word for it.

regards
Steve
 
M

Martin v. Löwis

I still don't see what is so good about defaults that lead to O(N*N)
computation for a O(N) problem, and I like Amaury's suggestion a lot,
so I would like to see comments on its disadvantages. Please don't
tell me that O(N*N) is good enough. For N>1E7 it isn't.

Please understand that changing the defaults will *not* affect the
asymptotic complexity. If your application shows O(N*N), then,
multiplying each value in the gc frequency with 1000, your application
might run faster, but it will *still* run at O(N*N).

Regards,
Martin
 
R

Rhamphoryncus

I'll take your word for it.

I discussed it not too long ago, but I can't seem to find the thread..

Basically, python's gen2 collections are to blame. You get a full
(gen2) collection a linear number of times for a linear number of
allocations, but the cost of each collection is also linear, giving
you O(n**2) cost. I think it's fairly clear that it should not be
that expensive.
 
A

Aaron Watters

A question often asked--and I am not a big a fan of these sorts of
questions, but it is worth thinking about--of people who are creating
very large data structures in Python is "Why are you doing that?"
That is, you should consider whether some kind of database solution
would be better. You mention lots of dicts--it sounds like some
balanced B-trees with disk loading on demand could be a good choice.

Well, probably because you can get better
than 100x improved performance
if you don't involve the disk and use clever in memory indexing.
BTW, I think the default behaviour of the gc is
pretty silly. I tend to disable automatic gc and explicitly put in
collections when I know I'm done with some big operation these
days.
-- Aaron Watters

===
http://www.xfeedme.com/nucular/pydistro.py/go?FREETEXT=dumb+slow
 
C

Carl Banks

Well, probably because you can get better
than 100x improved performance
if you don't involve the disk and use clever in memory indexing.

Are you sure it won't involve disk use? I'm just throwing this out
there, but if you're creating a hundreds of megabytes structure in
memory there's a chance the OS will swap it out to disk, which defeats
any improvements in latency you would have gotten.

However, that is for the OP to decide. The reason I don't like the
sort of question I posed is it's presumptuous--maybe the OP already
considered and rejected this, and has taken steps to ensure the in
memory data structure won't be swapped--but a database solution should
at least be considered here.


Carl Banks
 
A

Aaron Watters

However, that is for the OP to decide. The reason I don't like the
sort of question I posed is it's presumptuous--maybe the OP already
considered and rejected this, and has taken steps to ensure the in
memory data structure won't be swapped--but a database solution should
at least be considered here.

Yes, you are right, especially if the index structure will be needed
many times over a long period of time. Even here though, these days,
you can go pretty far by loading everything into core (streaming from
disk) and dumping everything out when you are done, if needed
(ahem, using the preferred way to do this from python for
speed and safety: marshal ;) ).

Even with Btree's if you jump around in the tree the performance can
be
awful. This is why Nucular, for example, is designed to stream
results sequentially from disk whenever possible. The one place where
it doesn't do this very well (proximity searches) shows the most
problems with performance (under bad circumstances like searching
for two common words in proximity).
-- Aaron Watters
===
http://www.xfeedme.com/nucular/pydistro.py/go?FREETEXT=joys
 
P

Paul Rubin

Aaron Watters said:
Even with Btree's if you jump around in the tree the performance can
be awful.

The Linux file cache really helps. The simplest approach is to just
"cat" the index files to /dev/null a few times an hour. Slightly
faster (what I do with Solr) is mmap the files into memory and read a
byte from each page now and then. Assuming (as in Lucene) that the
index file format is compressed, this approach is far more
ram-efficient than actually unpacking the index into data
structures. though of course you take the overhead (a few
microseconds) of a couple system calls at each access to the index
even when it's all in cache.
 
D

Dieter Maurer

Christian Heimes said:
Martin said that the default settings for the cyclic gc works for most
people. Your test case has found a pathologic corner case which is *not*
typical for common application but typical for an artificial benchmark.
Python is optimized for regular apps, not for benchmark (like some video
drivers).

Martin said it but nevertheless it might not be true.

We observed similar very bad behaviour -- in a Web application server.
Apparently, the standard behaviour is far from optimal when the
system contains a large number of objects and occationally, large
numbers of objects are created in a short time.
We have seen such behaviour during parsing of larger XML documents, for
example (in our Web application).


Dieter
 
M

Martin v. Löwis

Martin said it but nevertheless it might not be true.
We observed similar very bad behaviour -- in a Web application server.
Apparently, the standard behaviour is far from optimal when the
system contains a large number of objects and occationally, large
numbers of objects are created in a short time.
We have seen such behaviour during parsing of larger XML documents, for
example (in our Web application).

I don't want to claim that the *algorithm* works for all typically
applications well. I just claim that the *parameters* of it are fine.
The OP originally proposed to change the parameters, making garbage
collection run less frequently. This would a) have bad consequences
in terms of memory consumption on programs that do have allocation
spikes, and b) have no effect on the asymptotic complexity of the
algorithm in the case discussed.

It may well be that other algorithms would perform better, but
none have been contributed.

Regards,
Martin
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,764
Messages
2,569,567
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top