garbage collection / cyclic references

A

Aaron Brady

Hello,

I was reading and Googling about garbage collection, reference
counting, and the problem of cyclic references.

Python's garbage collection module claims to be able to detect and
break cyclic garbage. Some other languages merely prohibit it. Is
this the place to ask about its technique?

I understand that the disadvantage is a non-deterministic order of
deletion/finalization. It's an acceptable cost for the system I am
considering. I may even impose an arbitrary but consistent order on
it, such as by class name, by id, etc. (tm!).

After all the hullabaloo, I don't see that Python's purportedly
successful strategy amounts to anything more than a breadth-first
search, plus some optimizations. Did I miss something? What are its
caveats and fragilities? If free software can do it, why isn't it all
over the industry? What disqualifies it from solved-problem status?

Finally, what are the costs and benefits of an additional
'__gc_clear__' special method, for example, that is the equivalent of
the 'tp_clear' method on extension types? Or, perhaps, a
'__gc_cycle__' method, that is called with the attribute names of
attributes that lead to a cycle, that would be called prior to the
'__del__' method?
 
P

Paul Rubin

andrew cooke said:
the two dominant virtual machines - .net and the jvm both handle circular
references with no problem whatever.

AFAIK, they also don't guarantee that finalizers ever run, much less
run in deterministic order.
 
A

Aaron Brady

Aaron Brady wrote:

[...]
caveats and fragilities?  If free software can do it, why isn't it all
over the industry?  What disqualifies it from solved-problem status?

the two dominant virtual machines - .net and the jvm both handle circular
references with no problem whatever.  this is standard in modern garbage
collection - go read a book on the subject (personally i like grune et
al's modern compiler design).  it *is* a solved problem.  if anything,
python is behind the curve, not ahead of it, but this may change with the
next generation of python implementations (pypy targets a variety of vms,
i think).

as for the extra methods you suggest - why do you want to expose
implementation details in an api?  that is not the normal aim of good
design.

andrew

"Circular references ...can only be cleaned up if there are no Python-
level __del__() methods involved." __del__ doc.

"Python doesn’t collect ... cycles automatically because, in general,
it isn’t possible for Python to guess a safe order in which to run the
__del__() methods." gc.garbage doc.

"Errors should never pass silently." -The Zen of Python

I advance that cyclic objects should be notified when their external
references go to zero, but their usual '__del__' is inappropriate. If
objects implement a __del__ method, they can choose to implement a
'__gc_cycle__' method, and then just drop the specified attribute. It
needn't be called on every object in the cycle, either; once it's
called on one object, another object's normal __del__ may be safely
called. Output, unproduced:
In X.__gc_cycle__, 'other' attribute. Deleting...
In Y.__del__.
In X.__del__.
The actual backend of CPython requires garbage-collected container
types to implement tp_inquiry and tp_clear methods, but user-defined
types apparently aren't required to conform.

Supporting Cyclic Garbage Collection
http://docs.python.org/3.0/c-api/gcsupport.html
 
M

Martin v. Löwis

The actual backend of CPython requires garbage-collected container
types to implement tp_inquiry and tp_clear methods, but user-defined
types apparently aren't required to conform.

tp_inquiry doesn't exist, you probably mean tp_traverse. tp_traverse
is completely irrelevant for python-defined types; the VM can traverse
a user-defined type just fine even without the help of tp_traverse.
If a C-defined type fails to implement tp_traverse when it should,
then garbage collection breaks entirely.

tp_clear isn't invoked for an object at all if the object is in a
cycle with finalizers, so it's not something that you can use to
detect that you are in a cycle with finalizers.

Cycles with finalizers are considered a bug; application programmers
should check gc.garbage at the end of the program to determine whether
they have this bug. There is an easy design pattern around it, so I'm
-1 on complicating the GC protocol.

Regards,
Martin
 
A

Aaron Brady

i think you're right, but i'm missing your point - perhaps there was some
sub-context to the original post that i didn't understand?

finalizers should not be considered part of a public resource management
api - they should not be used to do things like flushing and closing
files, for example.   i think this was a minor "issue" early in java's
adoption (i guess because of incorrect assumptions made by c++
programmers) (in python the with context is a much better mechanism for
this kind of thing - the best java has is the finally statement).  but
it's one of those things that (afaik) isn't an issue once you fully
embrace the language (rather like, say, semantically meaningful
indentation).

but i'm sure you know all that, so i'm still wondering what i've missed.

andrew

Theoretically, my object should be able to maintain an open resource
for its lifetime; and its clients shouldn't need to know what its
lifetime is. Therefore, it needs a callback when that is over.

If finalization methods could be called in a structurally sound
manner, they could be relied on to handle flushing and closing files.
they should not be used to do things like flushing and closing
files, for example.

What is your basis for this claim, if it's not the mere unreliability
of finalization? IOW, are you not merely begging the question?
 
A

Aaron Brady

Aaron said:
What is your basis for this claim, if it's not the mere unreliability
of finalization?  IOW, are you not merely begging the question?

I'm not sure it's clear, but I was talking about Java.

As Paul implied, a consequence of completely automated garbage management
is that it is (from a programmer's POV) deterministic.  So it's a [indeterministic]
programming error to rely on the finalizer to free resources that don't
follow that model (ie any resource that's anything other that [than]
reasonable
amounts of memory).

That's pretty much an unavoidable consequence of fully automated garbage
collection.  You can pretend it's not, and try using finalizers for other
work if you want.  That's fine - it's your code, not mine.  I'm just
explaining how life is.

Andrew

Hi, nice to talk to you this early. Sorry you're in a bad mood.
You've sure come to the right place to find friends though. </tongue
in cheek>

My point is, that garbage collection is able to detect when there are
no program-reachable references to an object. Why not notify the
programmer (the programmer's objects) when that happens? If the
object does still have other unreachable references, s/he should be
informed of that too.

I advanced an additional method to this end. Do you argue that there
aren't any cases in which the class could make use of the information;
or there aren't /enough/ cases so in which?

Perhaps it would help to handle a contrary case by hand. Two objects
need to make write operations each to the other when they are closed.
Would it be sufficient in general, knowing nothing further about them,
to queue some information, and close? Do they always know at design-
time their references will be cyclic? Would a mere
'__leave_reachability__' method be more generally informative or
robust? Would it constitute a two-step destruction, to notify objects
when they're unreachable, and then finalize? The two objects' write
operations could execute in such a method, without risking prior
destruction.
 
A

Aaron Brady

I'm not sure it's clear, but I was talking about Java.
As Paul implied, a consequence of completely automated garbage management
is that it is (from a programmer's POV) deterministic.  So it's a [indeterministic]
programming error to rely on the finalizer to free resources that don't
follow that model (ie any resource that's anything other that [than]
reasonable
amounts of memory).
That's pretty much an unavoidable consequence of fully automated garbage
collection.  You can pretend it's not, and try using finalizers for other
work if you want.  That's fine - it's your code, not mine.  I'm just
explaining how life is.

My point is, that garbage collection is able to detect when there are
no program-reachable references to an object.  Why not notify the
programmer (the programmer's objects) when that happens?  If the
object does still have other unreachable references, s/he should be
informed of that too.
snip

I took the liberty of composing a sample cyclic reference detector. I
will post the class definition later on in the discussion (when and
if).

The 'run' method resets the globals to a sample graph, as
illustrated. 'p' and 's' start out with one simulated program-visible
reference each. As you see, the details are already long and boring
(yum). I added comments post-facto.
{<s,0>: 0, <p,2>: 0, <r,1>: 0, <q,2>: 0}
<s,0> ALL zero #'s' safe to delete
{<p,1>: 0, <q,2>: 0, <r,1>: 0}
{<s,0>: 0, <p,3>: 1, <r,1>: 0, <q,2>: 0}
{<s,0>: 0, <p,3>: 1, <q,2>: 0, <r,1>: 0}
{<p,2>: 1, <q,2>: 0, <r,1>: 0}
finalizing <s,0> #'p' not safe, reference still visible

We notice the duplicate 'all zero' indicator on run #2. The cycle
detector ran on 's.decref', then 's' called 'p.decref', then the cycle
detector ran on that. 'q' and 'r' are safe to delete on runs 2 and 3.

Here is the implementation of 'final':

def final( self ):
for _, v in self.__dict__.items( ):
if not isinstance( v, G ):
continue
v.decref( )
print( 'finalizing', self )

The object should be asked to finish its references (cyclic only?),
but remain alive. The programmer should see that the state is
consistent. Later, its __del__ will be called.

We can decide that '__leave_reachability__' will be called without
nesting; and/or that '__del__' will be called without nesting, by
breaking finalization in to two steps.

FTR, this makes __leave_reachability__ about the equivalent of
tp_clear, since tp_traverse is prior defined for user-defined types.
 
J

John Nagle

Aaron said:
Hello,

I was reading and Googling about garbage collection, reference
counting, and the problem of cyclic references.

Python's garbage collection module claims to be able to detect and
break cyclic garbage. Some other languages merely prohibit it. Is
this the place to ask about its technique?

I understand that the disadvantage is a non-deterministic order of
deletion/finalization.

Garbage collection and destructors or "finalizers" don't play well
together. It's a fundamental problem. Calling finalizers from the
garbage collector is painful. It introduces concurrency where the
user may not have expected it. Consider what happens if a finalizer
tries to lock something. What if GC runs while that lock is locked?
This can create a deadlock situation. Calling finalizers from the
garbage collector can result in intermittent, very hard to find bugs.

C++ takes destructors seriously; objects are supposed to be destructed
exactly once, and if they're of "auto" scope (a local object in the
Python sense) they will reliably be cleaned up at block exit.
Microsoft's "Managed C++" broke those rules; in Managed C++,
destructors can be called more than once. (Look up "re-animation"
in Microsoft Managed C++ literature. It's not pretty.)

Python actually has reference counting backed up by garbage collection.
Most objects are destroyed as soon as all references to them disappear.
Garbage collection is only needed to deal with cycles.

Python has "weak references", which won't keep an object around
once all the regular references are deleted. These are useful in
some situations. In a tree, for example, pointers towards the leaves
should be strong pointers, while back-pointers towards the root should
be weak pointers.

I once modified BeautifulSoup, the HTML parser, to use weak pointers
that way. BeautifulSoup trees are big and don't go away immediately when
no longer needed, because they have backpointers. They hang around until
the next GC cycle. With the version that uses weak pointers, they go away
as soon as they're no longer needed. We've found this useful in a web
crawler; the data space used drops and actual GC runs are no longer
necessary.

Personally, I'd argue that the right answer is to prohibit cycles of
strong pointers. That should be considered a programming error, and
detected at run time, at least by debugging tools. With weak pointers,
you don't really need cycles of strong pointers.

The advantage of this is a clean order of destruction. This is useful
in window widget systems, where you have objects with pointers going in many
directions, yet object destruction has substantial side effects.

Python originally had only reference counting, and didn't have weak pointers.
If weak pointers had gone in before the garbage collector, Python might have
gone in this direction.

John Nagle
 
A

Aaron Brady

     Garbage collection and destructors or "finalizers" don't play well
together.  It's a fundamental problem.  Calling finalizers from the
garbage collector is painful.  It introduces concurrency where the
user may not have expected it.  Consider what happens if a finalizer
tries to lock something.  What if GC runs while that lock is locked?
This can create a deadlock situation.  Calling finalizers from the
garbage collector can result in intermittent, very hard to find bugs.

As I understand it, 'obj.decref()' can call 'other.decref()', which
can try to access its reference to 'obj', which has already begun
cleanup. At that point, 'obj' is in an inconsistent state. Its own
methods are secretly called during its '__del__'.

One step would be to serialize this process, so that 'other.decref()'
gets deferred until 'obj.decref()' has completed. Then attributes are
in an all-or-nothing state, which is at least consistent. However,
that means every external reference in a '__del__' method has to be
wrapped in an exception handler, one at a time, because the object /
might/ already be in a reference cycle. (Non-container types are
excepted.) The remaining reference merely needs to change its class
to a ReclaimedObject type. That's acceptable if documented. I also
believe it solves the potential for deadlock.
(Look up "re-animation"
in Microsoft Managed C++ literature.  It's not pretty.)
Pass!

     Python actually has reference counting backed up by garbage collection.
Most objects are destroyed as soon as all references to them disappear.
Garbage collection is only needed to deal with cycles.

     Python has "weak references", which won't keep an object around
once all the regular references are deleted.  These are useful in
some situations.  In a tree, for example, pointers towards the leaves
should be strong pointers, while back-pointers towards the root should
be weak pointers. snip

    Personally, I'd argue that the right answer is to prohibit cycles of
strong pointers.  That should be considered a programming error, and
detected at run time, at least by debugging tools.  With weak pointers,
you don't really need cycles of strong pointers.

Reference cycles can be detected anyway with debug tools, even prior
to destruction. My objection is that would complicate control flow
severely:

for x in y:
z.append( x )

becomes:

for x in y:
if cyclic_ref( x ):
z.append( weakref.ref( x ) )
else:
z.append( x )

And worse, every attribute access has to be wrapped.

for x in z:
if isinstance( x, __builtins__.weakref ):
if x() is not None:
print( x() )
else:
print( x )

In other words, it interferes with uniform access to attributes and
container members. However, in the case where you know a structure a
priori, it's a good technique, as your example showed. I observe that
my proposal has the same weakness!

If you make the case that you usually do know the structure your data
have, I won't be able to disprove it. The example would come from a
peer-to-peer representation of something, or storage of relational
data.

Regardless, the group has responded to most of my original post. I
don't think I emphasized however that I'm designing an allocation
system that can contain reference cycles; and I was asking if such
special methods, '__gc_cycle__( self, attr )' or '__gc_clear__
( self )' would be right for me. I'm also interested in feedback
about the serialization method of ref. counting earlier in this post.
 
A

Aaron Brady

i think we're mixing python-specific and more general / java details, but,
as far as i understand things, state of the art (and particularly
generational) garbage collectors don't guarantee that objects will ever be
reclaimed.  this is a trade for efficiency, and it's a trade that seems to
be worthwhile and popular.

It's at best worthless, but so is politics. I take it back; you can
reclaim memory in large numbers with a probabilistic finalizer. The
expected value of reclaiming a KB with a 90% chance of call is .9 KB.

The allocation structure I am writing will have a very long up-time.
I can forcibly reclaim the memory of an object involved in a cycle,
but lingering references it has will never be detected. Though, if I
can't guarantee 100% reclamation, I'll have to be anticipating a
buffer dump eventually anyway, which makes, does it not, 90% about the
same as 99%.
furthermore, you're mixing responsibilities for two logically separate
ideas just because a particular implementation happens to associate them,
which is not a good idea from a design pov.

I think a silent omission of finalization is the only alternative. If
so they're mixed, one way or the other. I argue it is closer to
associating your class with a hash table: they are logically separate
ideas. Perhaps implementation is logically separate from design
altogether.
i can remember, way back in the mists of time

I understand they were having a fog problem there yesterday... not to
mention a sale on sand. "Today: Scattered showers and thunderstorms
before 1pm, then a slight chance of showers."
using java finalizers for
doing this kind of thing.  and then learning that it was a bad idea.  once
i got over the initial frustration, it really hasn't been a problem.  i
haven't met a situation

I don't suppose I imagine one. So, you could argue that it's a low
priority. Washing your hands of the rare, though, disqualifies you
from the associate's in philosophy. I bet you want to meet my
customers, too.
where i needed to tie resource management and
memory management together (except for interfacing with c code that does
not use the host language's gc - and i can imagine that for python this is
a very strong (perhaps *the*) argument for reference counting).

I'm using a specialized mapping type to implement the back end of user-
defined classes. Since I know the implementation, which in particular
maps strings to objects, I can always just break cycles by hand; that
is, until someone wants a C extension. Then they will want tp_clear
and tp_traverse methods.
as an bonus example, consider object caching - a very common technique
that immediately breaks anything that associates other resources with
memory use.

I assume your other processes are notified of the cache state. From
what I understand, Windows supports /named/ caching. Collaborators
can check the named cache, in the process creating it if it doesn't
exist, and read and write at will there.
just because, in one limited case, you can do something, doesn't mean it's
a good idea.

andrew

But you're right. I haven't talked this over much on the outside, so
I might be missing something huge, and serialized two-step
finalization (tm) is the secret least of my worries.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,770
Messages
2,569,584
Members
45,076
Latest member
OrderKetoBeez

Latest Threads

Top