Queue cleanup

E

EW

Hi

I'm writing a multithreaded app that relies on Queues to move data
between the threads. I'm trying to write my objects in a general way
so that I can reuse them in the future so I need to write them in such
a way that I don't know how many producer and how many consumer
threads I might need. I also might have different consumer threads do
different tasks (for example one might write to a log and one might
write to SQL) so that again means I can't plan for a set ratio of
consumers to producers. So it's unknown.

So this means that instead of having 1 Queue that all the producers
put to and that all the consumers get from I actually have 1 Queue per
producer thread that the main body sends to the correct type of
consumer thread. So I could get something like this where 3 producer
threads write to 3 different Queues all of which get read by 1
consumer thread:

P1 P2 P3
\ | /
\ | /
C1

So producers 1, 2, and 3 all write to individual Queues and consumer 1
had a list of those Queues and reads them all. The problem I'm having
is that those producer threads can come and go pretty quickly and when
they die I can cleanup the thread with join() but I'm still left with
the Queue. So I could get something like this:

P1 P3
\ | /
\ | /
C1

So here the P2 thread has ended and gone away but I still have his
Queue lingering.

So on a thread I can use is_alive() to check status and use join() to
clean up but I don't see any analogous functionality for Queues. How
do I kill them? I thought about putting a suicide message on the
Queue and then C1 would read it and set the variable to None but i'm
not sure setting the variable to None actually makes the Queue go
away. It could just end up sitting in memory unreferenced - and
that's not good. Additionally, I could have any number of consumer
threads reading that Queue so once the first one get the suicide note
the other consumer threads never would.

I figure there has to be an elegant way for managing my Queues but so
far I can't find it. Any suggestions would be appreciated and thanks
in advance for any help.


ps Python rocks.
 
E

EW

Hi

I'm writing a multithreaded app that relies on Queues to move data
between the threads.  I'm trying to write my objects in a general way
so that I can reuse them in the future so I need to write them in such
a way that I don't know how many producer and how many consumer
threads I might need.  I also might have different consumer threads do
different tasks (for example one might write to a log and one might
write to SQL) so that again means I can't plan for a set ratio of
consumers to producers.  So it's unknown.

So this means that instead of having 1 Queue that all the producers
put to and that all the consumers get from I actually have 1 Queue per
producer thread  that the main body sends to the correct type of
consumer thread.  So I could get something like this where 3 producer
threads write to 3 different Queues all of which get read by 1
consumer thread:

P1    P2   P3
     \    |   /
       \  |  /
        C1

So producers 1, 2, and 3 all write to individual Queues and consumer 1
had a list of those Queues and reads them all.  The problem I'm having
is that those producer threads can come and go pretty quickly and when
they die I can cleanup the thread with join() but I'm still left with
the Queue.  So I could get something like this:

P1         P3
     \    |   /
       \  |  /
        C1

So here the P2 thread has ended and gone away but I still have his
Queue lingering.

So on a thread I can use is_alive() to check status and use join() to
clean up but I don't see any analogous functionality for Queues.  How
do I kill them?  I thought about putting a suicide message on the
Queue and then C1 would read it and set the variable to None but i'm
not sure setting the variable to None actually makes the Queue go
away.  It could just end up sitting in memory unreferenced - and
that's not good.  Additionally, I could have any number of consumer
threads reading that Queue so once the first one get the suicide note
the other consumer threads never would.

I figure there has to be an elegant way for managing my Queues but so
far I can't find it.  Any suggestions would be appreciated and thanks
in advance for any help.

ps Python rocks.

Whoo..the formatting got torn up! My terrible diagrams are even more
terrible! Oh well, I think you'll catch my meaning :)
 
P

Paul Rubin

EW said:
I also might have different consumer threads do
different tasks (for example one might write to a log and one might
write to SQL) so that again means I can't plan for a set ratio of
consumers to producers.... So it's unknown.

So this means that instead of having 1 Queue that all the producers
put to and that all the consumers get from I actually have 1 Queue per
producer thread

That doesn't sound appropriate. Queues can have many readers and many
writers. So use one queue per task (logging, SQL, etc), regardless of
the number of producer or consumer threads. Any producer with an SQL
request sends it to the SQL queue, which can have many listeners. The
different SQL consumer threads listen to the SQL queue and pick up
requests and handle them.
 
E

EW

That doesn't sound appropriate.  Queues can have many readers and many
writers.  So use one queue per task (logging, SQL, etc), regardless of
the number of producer or consumer threads.  Any producer with an SQL
request sends it to the SQL queue, which can have many listeners.  The
different SQL consumer threads listen to the SQL queue and pick up
requests and handle them.

I thought about doing it that way and I could do it that way but it
still seems like there should be a way to clean up Queues on my own.
If I did it this way then I guess I'd be relying on garbage collection
when the script ended to clean up the Queues for me.

What if I want to clean up my own Queues? Regardless of the specifics
of my current design, I'm just generally curious how people manage
cleanup of their Queues when they don't want them any more.
 
M

MRAB

EW wrote:
[snip]
So here the P2 thread has ended and gone away but I still have his
Queue lingering.

So on a thread I can use is_alive() to check status and use join() to
clean up but I don't see any analogous functionality for Queues. How
do I kill them? I thought about putting a suicide message on the
Queue and then C1 would read it and set the variable to None but i'm
not sure setting the variable to None actually makes the Queue go
away. It could just end up sitting in memory unreferenced - and
that's not good. Additionally, I could have any number of consumer
threads reading that Queue so once the first one get the suicide note
the other consumer threads never would.

I figure there has to be an elegant way for managing my Queues but so
far I can't find it. Any suggestions would be appreciated and thanks
in advance for any help.
An object will be available for garbage collection when nothing refers
to it either directly or indirectly. If it's unreferenced then it will
go away.

As for the suicide note, if a consumer sees it then it can put it back
into the queue so other consumers will see it and then forget about the
queue (set the variable which refers to the queue to None, or, if the
references are in a list, delete it from the list).
 
E

EW

EW wrote:

[snip]


So here the P2 thread has ended and gone away but I still have his
Queue lingering.
So on a thread I can use is_alive() to check status and use join() to
clean up but I don't see any analogous functionality for Queues.  How
do I kill them?  I thought about putting a suicide message on the
Queue and then C1 would read it and set the variable to None but i'm
not sure setting the variable to None actually makes the Queue go
away.  It could just end up sitting in memory unreferenced - and
that's not good.  Additionally, I could have any number of consumer
threads reading that Queue so once the first one get the suicide note
the other consumer threads never would.
I figure there has to be an elegant way for managing my Queues but so
far I can't find it.  Any suggestions would be appreciated and thanks
in advance for any help.

An object will be available for garbage collection when nothing refers
to it either directly or indirectly. If it's unreferenced then it will
go away.

As for the suicide note, if a consumer sees it then it can put it back
into the queue so other consumers will see it and then forget about the
queue (set the variable which refers to the queue to None, or, if the
references are in a list, delete it from the list).

Ok great. I wasn't sure about the Garbage collection part of it.
That's actually pretty easy.

Thanks!
 
P

Paul Rubin

EW said:
I thought about doing it that way and I could do it that way but it
still seems like there should be a way to clean up Queues on my own.
If I did it this way then I guess I'd be relying on garbage collection
when the script ended to clean up the Queues for me.

Oh, I see. As long as it's possible to start new producer or consumer
threads that touch a queue, obviously that queue has to still be around.
If the program starts all its threads at the beginning, then runs til
they exit, then does more stuff, then you could do something like:

# make dictonary of queues, one queue per task type
queues = {'sql': Queue(), 'logging': Queue(), ... }

for i in <whatever threads you want>
threading.Thread(target=your_handler, args=[queues])

del queues

and then when all the threads exit, there are no remaining references to
the queues. But why do you care? Queues aren't gigantic structures,
they're just a list (collections.deque) with an rlock. It's fine to let
the gc clean them up; that's the whole point of having a gc in the first
place.
 
E

EW

EW said:
I thought about doing it that way and I could do it that way but it
still seems like there should be a way to clean up Queues on my own.
If I did it this way then I guess I'd be relying on garbage collection
when the script ended to clean up the Queues for me.

Oh, I see.  As long as it's possible to start new producer or consumer
threads that touch a queue, obviously that queue has to still be around.
If the program starts all its threads at the beginning, then runs til
they exit, then does more stuff, then you could do something like:

    # make dictonary of queues, one queue per task type
    queues = {'sql': Queue(), 'logging': Queue(), ... }

    for i in <whatever threads you want>
       threading.Thread(target=your_handler, args=[queues])

    del queues

and then when all the threads exit, there are no remaining references to
the queues.  But why do you care?  Queues aren't gigantic structures,
they're just a list (collections.deque) with an rlock.  It's fine to let
the gc clean them up; that's the whole point of having a gc in the first
place.

Well I cared because I thought garbage collection would only happen
when the script ended - the entire script. Since I plan on running
this as a service it'll run for months at a time without ending. So I
thought I was going to have heaps of Queues hanging out in memory,
unreferenced and unloved. It seemed like bad practice so I wanted to
get out ahead of it.

But the GC doesn't work the way I thought it worked so there's really
no problem I guess. I was just confused on garbage collection it seems.
 
P

Paul Rubin

EW said:
Well I cared because I thought garbage collection would only happen
when the script ended - the entire script. Since I plan on running
this as a service it'll run for months at a time without ending. So I
thought I was going to have heaps of Queues hanging out in memory,
unreferenced and unloved. It seemed like bad practice so I wanted to
get out ahead of it.

Even if GC worked that way it wouldn't matter, if you use just one queue
per type of task. That number should be a small constant so the memory
consumption is small.
 
M

MRAB

Paul said:
Even if GC worked that way it wouldn't matter, if you use just one queue
per type of task. That number should be a small constant so the memory
consumption is small.

That's basically how _non_-garbage-collected languages work! :)
 
E

EW

Even if GC worked that way it wouldn't matter, if you use just one queue
per type of task.  That number should be a small constant so the memory
consumption is small.

Well I can't really explain it but 1 Queue per task for what I'm
designing just doesn't feel right to me. It feels like it will lack
future flexibility. I like having 1 Queue per producer thread object
and the person instantiating that object can do whatever he wants with
that Queue. I can't prove I'll need that level of flexibility but I
don't see why it' bad to have. It's still a small number of Queues,
it's just a small, variable, number of Queues.
 
P

Paul Rubin

EW said:
Well I can't really explain it but 1 Queue per task for what I'm
designing just doesn't feel right to me. It feels like it will lack
future flexibility.

That makes no sense at all. Multiple readers and writers per queue are
the way Python queues are designed to work. The normal way to spray a
bunch of concurrent tasks to worker threads is just have a bunch of
workers listening to one queue. It's the same way at the producer end.
 
A

Aahz

An object will be available for garbage collection when nothing refers
to it either directly or indirectly. If it's unreferenced then it will
go away.

This isn't actually garbage collection as most people think of it.
Refcounting semantics mean that objects get reaped as soon as nothing
points at them. OTOH, CPython does also have garbage collection to back
up refcounting so that when you have unreferenced object cycles they
don't stay around.
 
J

John Nagle

Well I can't really explain it but 1 Queue per task for what I'm
designing just doesn't feel right to me. It feels like it will lack
future flexibility. I like having 1 Queue per producer thread object
and the person instantiating that object can do whatever he wants with
that Queue. I can't prove I'll need that level of flexibility but I
don't see why it' bad to have. It's still a small number of Queues,
it's just a small, variable, number of Queues.

That's backwards. Usually, you want one queue per unique consumer.
That is, if you have a queue that contains one kind of request,
there's one thread reading the queue, blocked until some other
thread puts something on the queue. No polling is needed.

One consumer reading multiple queues is difficult to implement
well.

Note, by the way, that CPython isn't really concurrent. Only
one thread runs at a time, due to an archaic implementation. So
if your threads are compute-bound, even on a multicore CPU threading
will not help.

There's a "multiprocessing module" which allows spreading work
over several processes instead of threads. That can be helpful
as a workaround.

John Nagle
 
S

Steven D'Aprano

This isn't actually garbage collection as most people think of it.
Refcounting semantics mean that objects get reaped as soon as nothing
points at them. OTOH, CPython does also have garbage collection to back
up refcounting so that when you have unreferenced object cycles they
don't stay around.


I've repeatedly asked, both here and elsewhere, why reference counting
isn't "real" garbage collection. Nobody has been able to give me a
satisfactory answer. As far as I can tell, it's a bit of pretentiousness
with no basis in objective fact.

http://en.wikipedia.org/wiki/Garbage_collection_(computer_science)
http://en.wikipedia.org/wiki/Reference_counting

Reference counting is one specific kind of garbage collection, and like
all gc strategies, it has strengths as well as weaknesses. It is *simple*
to implement (which may be why a certain class of programmer likes to
think it isn't "real" gc). When it runs is deterministic, and is
immediate upon the resource being freed. The overhead is very light (a
plus) and continuous (which can be both a plus and a minus). It is better
suited to managing scarce resources like open files than are tracing
garbage collectors. It avoids the "embarrassing pause" of tracing
collectors. It doesn't deal well with reference cycles, and (at least
with Python's implementation of ref counting) it causes performance
issues with threaded applications.

http://en.wikipedia.org/wiki/Garbage_collection_(computer_science)
http://en.wikipedia.org/wiki/Reference_counting

So CPython has two garbage collectors -- a reference counting
implementation, and a tracing implementation. Jython and IronPython use
the native garbage collectors from Java and .Net. Other Pythons may use
something else.
 
P

Paul Rubin

Steven D'Aprano said:
I've repeatedly asked, both here and elsewhere, why reference counting
isn't "real" garbage collection. Nobody has been able to give me a
satisfactory answer. As far as I can tell, it's a bit of pretentiousness
with no basis in objective fact.

Well, it's a bit of a subjective matter. I'd say it's not real gc
because 1) it's unsound (misses reference cycles), and 2) it requires
constant attention from the mutator to incr and decr the reference
counts. So developing modules for the CPython API means endlessly
finding and fixing refcount bugs that lead to either crashes/security
failures, or memory leaks. If you program the Java JNI or a typical
Lisp FFI, you'll find that real gc is a lot simpler to use since you
avoid all the refcount maintenance hassles. You allocate memory and
shut your eyes, and the gc takes care of freeing it when it figures out
that you are done. Refcounting is basically a form of manual memory
management, while gc is automatic.

Someone said here recently that as a program gets larger, saying "this
will work as long as we do X every time without fail" becomes equal to
saying "this won't work". Substitute "properly maintain all ref counts"
for X and you can see the problem. I've seen released "production"
"tested" Python C modules with subtle refcount bugs on more than one
occasion. In gc'd systems there are fewer places for the code to go
wrong.
 
D

Dennis Lee Bieber

I've repeatedly asked, both here and elsewhere, why reference counting
isn't "real" garbage collection. Nobody has been able to give me a
satisfactory answer. As far as I can tell, it's a bit of pretentiousness
with no basis in objective fact.

http://en.wikipedia.org/wiki/Garbage_collection_(computer_science)
http://en.wikipedia.org/wiki/Reference_counting

Reference counting is one specific kind of garbage collection, and like
all gc strategies, it has strengths as well as weaknesses. It is *simple*

The nice thing about it is that it is sort of deterministic -- one
can examine code and determine that an object is collected at some point
in the execution...

Heap marking, OTOH, tends to run at indeterminate times, which could
have an impact if one needs predictable response timings (I still recall
the Supersoft LISP that ran on my TRS-80 -- the first time per session
that it ran the garbage collector could take minutes, as it basically
went through and marked all unused memory in the system as "available"
in order to add it to the allocator...)
 
P

Paul Rubin

Dennis Lee Bieber said:
The nice thing about it [reference counting] is that it is sort
of deterministic -- one can examine code and determine that an object
is collected at some point in the execution...
Heap marking, OTOH, tends to run at indeterminate times, which could
have an impact if one needs predictable response timings

Reference counting has the same problem. If you drop the last reference
to a complex structure, it could take quite a long time to free all the
components. By contrast there are provably real-time tracing gc
schemes, including some parallelizeable ones. One reason CPython still
can't run threads on parallel cores is it would have to lock the
reference counts every time they're updated, and the slowdown from that
is terrible.
 
S

Steven D'Aprano

Dennis Lee Bieber said:
The nice thing about it [reference counting] is that it is sort
of deterministic -- one can examine code and determine that an object
is collected at some point in the execution...
Heap marking, OTOH, tends to run at indeterminate times, which could
have an impact if one needs predictable response timings

Reference counting has the same problem.

In theory, yes, but in practice ref counting tends to spread out the
performance impact more smoothly. There are exceptions, such as the one
you mention below, but as a general rule ref counting isn't subject to
the "embarrassing pauses" that tracing garbage collectors tend to be
subject to.

If you drop the last reference
to a complex structure, it could take quite a long time to free all the
components. By contrast there are provably real-time tracing gc
schemes, including some parallelizeable ones.

I could be wrong, but how can they not be subject to the same performance
issue? If you have twenty thousand components that all have to be freed,
they all have to be freed whether you do it when the last reference is
cleared, or six seconds later when the gc does a sweep.

One reason CPython still
can't run threads on parallel cores is it would have to lock the
reference counts every time they're updated, and the slowdown from that
is terrible.

On the other hand, the reason that CPython still has reference counting
is that the alternatives tried so far are unacceptably for non-threaded
code.
 
S

Steven D'Aprano

Well, it's a bit of a subjective matter. I'd say it's not real gc
because 1) it's unsound (misses reference cycles),

You can add cycle detection to a reference count gc, at the cost of more
complexity.

If you read the Wikipedia article I linked to, tracing algorithms can
also be unsound:

Some collectors running in a particular environment can
correctly identify all pointers (references) in an object;
these are called "precise" (also "exact" or "accurate")
collectors, the opposite being a "conservative" or "partly
conservative" collector. Conservative collectors have to
assume that any bit pattern in memory could be a pointer if
(when interpreted as a pointer) it would point into any
allocated object. Thus, conservative collectors may have
some false negatives, where storage is not released because
of accidental fake pointers...

http://en.wikipedia.org/wiki/Garbage_collection_(computer_science)


and 2) it requires
constant attention from the mutator to incr and decr the reference
counts.

Yes. And?

So developing modules for the CPython API means endlessly
finding and fixing refcount bugs that lead to either crashes/security
failures, or memory leaks. If you program the Java JNI or a typical
Lisp FFI, you'll find that real gc is a lot simpler to use since you
avoid all the refcount maintenance hassles. You allocate memory and
shut your eyes, and the gc takes care of freeing it when it figures out
that you are done. Refcounting is basically a form of manual memory
management, while gc is automatic.


That's a problem with the CPython API, not reference counting. The
problem is that the CPython API is written at too low a level, beneath
that at which the garbage collector exists, so naturally you have to
manually manage memory.

Someone said here recently that as a program gets larger, saying "this
will work as long as we do X every time without fail" becomes equal to
saying "this won't work". Substitute "properly maintain all ref counts"
for X and you can see the problem. I've seen released "production"
"tested" Python C modules with subtle refcount bugs on more than one
occasion. In gc'd systems there are fewer places for the code to go
wrong.

On the other hand, tracing gcs have their own set of problems too, mostly
due to the use of finalizers and attempts to make garbage collection run
more predictably. See here:

http://publib.boulder.ibm.com/infoc...doc.diagnostics.142j9/html/coexistwithgc.html

Quote:

For tidying Java resources, think about the use of a clean
up routine. When you have finished with an object, call the
routine to null out all references, deregister listeners,
clear out hash tables, and so on. This is far more efficient
than using a finalizer and has the useful side-benefit of
speeding up garbage collection. The Garbage Collector does
not have so many object references to chase in the next
garbage collection cycle.


Translated: "Rather than relying on the garbage collector to clean up
resources after you, do it yourself, manually, so the garbage collector
has less work to do."

Tracing garbage collectors aren't a panacea. They're software themselves,
and complex software, which means they're subject to bugs like the one
which plagued Flash plugin 9:

http://gskinner.com/blog/archives/2008/04/failure_to_unlo.html

The more complicated the garbage collector, the more scope you have for
some interaction between your high-level code and the gc leading to
memory not be reclaimed or extreme slowdown. Like this:

http://tech.puredanger.com/2009/02/11/linkedblockingqueue-garbagecollection/
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,049
Latest member
Allen00Reed

Latest Threads

Top