multithreading.

J

Jerry Coffin

[ ... ]
I think that Steele also wrote a variation on Dijkstra's algorithm. If
you have an ACM account you can probably access it.

Guy L. Steele, Jr., Multiprocessing Compactifying Garbage Collection,
Comm. ACM, Sep. 1975, vol. 18, No. 9, pp. 495-508.

There has been a LOT of work on concurrent collectors since then.

Another paper that may be particularly applicable to this thread is _A
Unified Theory of Garbage Collection_, which deals specifically with not
only the differences but, especially, the similarities between garbage
collection based on reference counting and tracing:

http://www.research.ibm.com/people/d/dfb/papers/Bacon04Unified.pdf

Along with interesting content of its own, this has an _excellent_
bibliography.
 
G

Gerhard Fiedler

Yes and no. There's basically no such thing as a garbage collector (per
se) that does so.

At the same time, many of the same languages that are typically (or, in
some cases, mandatorily) implemented with garbage collection also often
include things like (again, optional or mandatory) tail recursion
elimination. That tail recursion elimination often makes more garbage
detectable.

Ah, thanks (and also to Jon). That clarifies some (to me at least :)

Gerhard
 
S

Stephen J. Bevan

Jon Harrop said:
Let's do a single-threaded benchmark first.


Symbolic rewriting would make a good benchmark. Here is one:

http://www.lambdassociates.org/studies/study10.htm

Try implementing this using reference counting and we can see how it
compares (probably on a trickier rewrite). Replace the arbitrary-precision
ints with doubles.

The (old) machine I read news on is not nearly so powerful as Mark's
so for reference here's the time for a slightly modified version[1] of
the O'Caml code :-

$ ocamlopt -v
The Objective Caml native-code compiler, version 3.08.2
Standard library directory: /usr/lib/ocaml/3.08
$ ocamlopt simplify.ml -o simplify
$ time ./simplify
Took 5.020000s

real 0m5.137s
user 0m5.021s
sys 0m0.005s

Now for C code that uses naive reference counting :-

$ cc -v
Reading specs from /usr/lib/gcc-lib/i486-linux/3.3.5/specs
Configured with: ../src/configure -v --enable-languages=c,c++,java,f77,pascal,objc,ada,treelang --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-gxx-include-dir=/usr/include/c++/3.3 --enable-shared --with-system-zlib --enable-nls --without-included-gettext --enable-__cxa_atexit --enable-clocale=gnu --enable-debug --enable-java-gc=boehm --enable-java-awt=xlib --enable-objc-gc i486-linux
Thread model: posix
gcc version 3.3.5 (Debian 1:3.3.5-8ubuntu2.1)
$ cc -O3 -Wall simplify.c -o simplify
$ time ./simplify
real 0m16.757s
user 0m16.553s
sys 0m0.003s
$

That's over 3x slower than O'Caml so O'Caml/GC is the winner, right?

We'll if your only choice is O'Caml or (naive) reference counting for
this problem then yes. However, there are other approaches. For
example, the following is the time for C code that uses a hand-coded
version of what a sufficiently smart compiler that implemented
linear/unique types could produce :-

$ time ./simplify
real 0m2.812s
user 0m2.751s
sys 0m0.002s

So on my machine naive reference counting is 3x slower than O'Caml GC
but the linear version is 2x faster than O'Caml. Should we conclude
that (O'Caml) GC and naive reference counting both suck?

My conclusion is that while the benchmark clearly measures something
it isn't clear to me that what it measures is interesting/useful. It
may be possible to make it useful in which case the benchmark should
also track peak/average heap size. At present that's pointless
because the the test expression is so small that the working set for
each iteration should only be a few hundred bytes. This will fit in
the nursery of any generational collector or only require a copying
collector to traverse a few hundred bytes. That's very GC friendly.

------------------

[1] Compiling the code from the site gave the following :-

$ ocamlopt simplify.ml -o simplify
No implementations provided for the following modules:
Num referenced from simplify.cmx

Rather than work out if I'm suffering from using an out of date
O'Caml or it just isn't installed right, I just changed the code to
use a good old data type and that works fine :-

$ cat simplify.ml
type expr = Int of int | Var of string | Add of expr * expr | Mul of expr * expr;;

let int n = (Int n);;

let ( +: ) f g = Add(f, g) and ( *: ) f g = Mul(f, g);;

let test_expr =
Var "x" *: ((int 12 *: int 0 +: (int 23 +: int 8)) +: Var "y");;

let time f x =
let t = Sys.time() in
let f_x = f x in
Printf.printf "Took %fs\n" (Sys.time() -. t);
f_x;;

let rec loop n f x = if n=1 then f x else (ignore(f x); loop (n-1) f x);;

let rec ( +: ) f g = match f, g with
| Int n, Int m -> Int (n + m)
| (Int 0), e | e, (Int 0) -> e
| f, Add(g, h) -> f +: g +: h
| f, g -> Add(f, g);;

let rec ( *: ) f g = match f, g with
| Int n, Int m -> Int (n * m)
| (Int 0), e | e, (Int 0) -> (Int 0)
| (Int 1), e | e, (Int 1) -> e
| f, Mul(g, h) -> f *: g *: h
| f, g -> Mul(f, g);;

let rec simplify = function
| Int _ | Var _ as f -> f
| Add (f, g) -> simplify f +: simplify g
| Mul (f, g) -> simplify f *: simplify g;;

time (loop 10000000 simplify) test_expr;;
 
S

Stephen J. Bevan

Chris Thomasson said:
[...]

Try the test using the following allocator:

http://groups.google.com/group/comp.lang.c++/browse_frm/thread/68bdc76f9792de1f

What results do you get?

Using malloc(3) directly causes the time for the reference counting C
code to be 30 seconds i.e. 6x slower than O'Caml. So I used a simple
object pooling scheme. Although there are 50000011 allocations, only
14 of them result in a call to malloc(3). The same pool is used in
the linear code so while it may be possible to improve the pool, the
fact that the linear code is 8x quicker than the reference counting
version shows that the (native) reference couting is the place to look
for performance improvements.

The above and the fact that your code is C++ and I'm using C means
that there is is little incentive to convert your C++ to C to try it.
 
S

Stephen J. Bevan

Chris Thomasson said:
Here is a crude C version of my single-threaded region allocator:

There were two reasons for not using your C++ allocator, the fact that
it was in C++ was the second reason. Having a C version does nothing
to address the first reason: the linear version 8x quicker than the
reference counting version using the _same_ allocator. Thus improving
the allocator will not make the reference counting version faster than
the linear version. It _may_ make the reference counting version
faster than O'Caml. However, looking at your C allocator code it does
observably more work per allocation than the pooling allocator I'm
using so it isn't going to be faster. Talk is cheap so I actually
tested it :-

$ time ./simplify speed
(0x804b148/0x804e008/1/8192)::allocator_sys_expand
(0x804b148/0x8052028/2/16384)::allocator_sys_expand
(0x804b148/0x804e008/2/8192)::allocator_sys_promote

real 0m17.530s
user 0m17.307s
sys 0m0.005s
$

This is one second slower than my original code. Not a huge slowdown,
so as code goes it is fine. However, it clearly doesn't make things
faster and so just re-inforces my first reason for not using your
C++ code: allocation is not the bottlneck, (naive) reference counting is.
 
S

Stephen J. Bevan

Chris Thomasson said:
Why type of pool allocator are you using? AFAICT, pool allocator is
analogous to region allocator in that allocation and reclamations can be
as simple as:

Given that the allocator performance is not the bottlenck so the
method I used is irrelevant.
The ratio of mallocs to frees does not need to be equal. Instead, it
can be N mallocs to 1 reset. The granularity can be improved by clever
partitioning of multiple regions.

Agreed, but unless you have a clever partition for this problem then
that idea is moot.
Did you free each object individually or did you take advantage of
the allocator_reset procedure? The region allocator I posted does not
perform all that well when you call allocator_release. Its best to
create multiple allocators and partition them into zones, and only
call reset on a zone when its in a persistent quiescent state.

I used it as a straight malloc&free replacment. There is no point in
trying to take advantage of anything because allocation is not the
bottleneck that is causing the reference counting version to be 8x
slower than a linear version.

I clearly misunderstood you; sorry about that non-sense. I was under the
impression that the allocator was the bottle neck. Humm, can you post the
test code your using?

I could but it is trivial. If you want to take up Jon's challenge
then you can easily write the code using whatever reference counting
pattern you like. Changing to one form of deferred reference counting
decreases the time to ~11 seconds. Still 2x slower than O'Caml and 5x
slower than a linear version.

Think of zone-based object reclamation.

How about you think about it? Jon was arguing for GC you were arguing
for reference counting. All I did was measure the two on Jon's
problem on my machine and show that on this test both suck and
linearity wins, though GC sucks less than two forms of reference
counting.
 
S

Stephen J. Bevan

Chris Thomasson said:
If you apply optimizations to naive reference counting, well, then its
no longer all that naive is it? Conclusion, use the right tool for the
job at hand.

And what might that tool be? It doesn't appear to be GC or reference
counting.
 
D

Dmitriy V'jukov

And what might that tool be? It doesn't appear to be GC or reference
counting.


Maybe this:
http://groups.google.ru/group/comp.programming.threads/browse_frm/thread/dab22a17c32c6b13
http://groups.google.ru/group/comp.programming.threads/msg/59c2bc2366831d29


Btw, what you mean by "reference counting"? Something like this:

void rc_acquire(rc_t* obj)
{
atomic_inc(&obj->rc);
}

void rc_release(rc_t* obj)
{
if (0 == atomic_dec(&obj->rc))
delete obj;
}


Are you testing on SMP/multicore machine?


Dmitriy V'jukov
 
S

Stephen J. Bevan

Dmitriy V'jukov said:

They may or may not help in an SMP situation but in a single threaded
scenario all they do is at more overhead and thus make reference
(naive) counting even slower.

Btw, what you mean by "reference counting"? Something like this:
[snip]

Yes, but since there is no SMP the inc/dec don't have to be atomic.
If they are the performance is even worse. Though to be fair then I'd
need to test against an SMP cable GC too.

Are you testing on SMP/multicore machine?

No.
 
S

Stephen J. Bevan

Chris Thomasson said:
I don't think its worth the effort. Oh well. Nobody is paying me here...

Nobody was paying you to argue with Jon that reference counting was
sufficient either.

I was trying to address the various false blanket claims that Jon was
making that he was making about every form of reference counting.

Jon challenged you to back up your claims that this claims were false
by providing a reference counting version that is faster than GC for
a specific test. I saw no response from you thus I don't see how you
addressed his claims at all.

Well, don't use refcounting for this problem then... Refcounting is
not the right answer to every problem. :^)

I agree, but then you were the one arguing for reference counting and
against GC not me.
 
S

Stephen J. Bevan

Chris Thomasson said:
That's not what I was arguing about. You need to re-read the
thread. Jon made misleading claims on reference couning, and I pointed
them out.

Jon's arguments against RC were :-

. Fragility.
. Memory overhead.
. Performance degradation.
. Fragmentation.
. Not incremental, i.e. awful worst case performance.

Whether each one is misleading is a matter of debate. I'd argue that
the first isn't. The last isn't unless you consider at least some
kind of deferred reference counting scheme which just adds to the
first issue, somewhat adds to the second depending on just when the
deferred work is done. The penultimate issue is true when compared to
a copying GC but not say a mark-sweep GC. Regarding performance
degredation for "collection intensive-tasks" you wrote in
<[email protected]> :-

What collection-intensive tasks? I know I can compete, and
most likely beat, a traditional GC with handcrafted implementations
that C++ give me the freedom to work with.

This led Jon to suggested a specific benchmark which is clearly small
enough for do that without requiring an onerous amount of work to
prove the point either way. One can certainly question the benchmark
(I certainly do) but as far as I can see you didn't post any kind of
followup at all. That's fine you don't have to, but then it is rather
bold to now claim you were pointing out misleading claims without
specifying which claims were misleading and which were not.

When did I say that reference counting in general is always faster
than GC?

I quoted what you claimed and so you were free to use reference
counting or not as you see fit. You didn't offer a solution.

When did I argue against GC? I was responding to Jon blanket false
claims.

Claiming they are false is not the same as showing they are false.
 
J

Jerry Coffin

[ ... ]
Jon's arguments against RC were :-

. Fragility.
. Memory overhead.
. Performance degradation.
. Fragmentation.
. Not incremental, i.e. awful worst case performance.

Whether each one is misleading is a matter of debate.

I'd say I disagree, except that disagreeing more or less confirms your
point. Some, however, are clearly false, so they _shouldn't_ be open to
debate, even if they are.
I'd argue that the first isn't.

When stated more specifically (i.e. noting the nature of the fragility)
I'd accept it as quite accurate.
The last isn't unless you consider at least some
kind of deferred reference counting scheme which just adds to the
first issue, somewhat adds to the second depending on just when the
deferred work is done.

More accurately, the last is really false (while still giving a fairly
accurate impression). Worse, your comments are incorrect as well --
deferring collection doesn't add to fragility at all.

Your claim of added memory overhead really IS quite misleading -- it's
easy to limit the extra memory to a single pointer.

Incremental collection is one of the major advantages of reference
counting -- most other forms of garbage collection have substantially
_worse_ worst-case performance, and limiting the worst case adds
considerably more to their complexity as well.
The penultimate issue is true when compared to
a copying GC but not say a mark-sweep GC.

Chris was talking about claims that were technically true, but
misleading anyway. This is the opposite: technically false, but most
gives an accurate idea anyway.

Reference counting is concerned only with determining _when_ objects can
be collected. It is entirely orthogonal to what you do with the memory
once you determine that objects are no longer needed.

At the same time, to compact the heap, you need to carry out something
on the order of the mark phase of a mark-sweep collector to find all the
pointers you're going to need to modify when you move an object. As
such, while there's nothing about RC that prevents it, I've never heard
of an RC-based collector that did it (and from a practical viewpoint, I
have a hard time imagining such a thing either).

[ ... ]
Claiming they are false is not the same as showing they are false.

Claiming they're true is not the same as showing they're true either.

I'd score the claims as:

Fragility: True -- when properly qualified
Memory Overhead: generally misleading
Performance degradation: Open to (lots of) debate
Fragmentation: technically false, practically true
Not Incremental: clearly false

Since this is cross-posted to c.l.c++, I feel obliged to point out that
most GC testing is done in languages (e.g. Lisp, Smalltalk, Self) in
which _all_ objects are allocated dynamically, and open to garbage
collection. The is NOT the case in C++; many optimizations in garbage
collectors are based on usage patterns that appear far less likely to
happen in C++. In particular, most GC optimization depends on the fact
that MOST objects die before a collection cycle happens. This is FAR
less like in C++. While the exact lifetime may not be known, the average
life-span of a dynamically allocated object in C++ is MUCH longer than
in these other languages.
 
S

Stephen J. Bevan

Jerry Coffin said:
More accurately, the last is really false (while still giving a fairly
accurate impression). Worse, your comments are incorrect as well --
deferring collection doesn't add to fragility at all.

So while you think that reference counting can be considered fragile
wyou do not consider that using a deferral mechanism makes it any more
fragile. Interesting. I disagree but I don't think that is
important.

Your claim of added memory overhead really IS quite misleading -- it's
easy to limit the extra memory to a single pointer.

I wrote "[deferred referenc counting] somewhat adds to the second
depending on just when the deferred work is done." That has nothing
to do with any extra pointer and everything to do with increased heap
occupancy that occurs due to not freeing memory right away as a naive
reference counting implementation would. Whether it actually has a
higher overhead than any other kind of GC would depend on the other GC
and more than likely the program(s).

At the same time, to compact the heap, you need to carry out something
on the order of the mark phase of a mark-sweep collector to find all the
pointers you're going to need to modify when you move an object. As
such, while there's nothing about RC that prevents it, I've never heard
of an RC-based collector that did it (and from a practical viewpoint, I
have a hard time imagining such a thing either).

One-bit RC implementations are typically coupled with a mark-sweep
collector when may or may not compact.


Claiming they're true is not the same as showing they're true either.

Indeed, and you'll note that I didn't enter this thread debating each
of the individual claims, I only addressed one of them: performance
degredation. I didn't even expect Chris to respond since it was Jon's
challenge to beat GC and while RC doesn't a linear version does.
The only reason for listing the claims again was Chris wrote :-
> That's not what I was arguing about. You need to re-read the
> thread. Jon made misleading claims on reference couning, and I
> pointed them out.
> I was responding to Jon blanket false claims.

and so I posted the list to clarify what those claims were and as a
side issue note that they aren't all "blanket false claims" as you
yourself agreed.
 
S

Stephen J. Bevan

Chris Thomasson said:
Stephen, do you honestly think that Jons blanket arguments apply to
every type of reference counting algorithm out there? Keep in mind
that he did not know that some state-of-the-art reference counting
algorithms existed.

The only citations I could find are to three techniques posted by
people in comp.programming.threads (one of which is your own). Any RC
mechanism which doesn't put an RC on every object but instead on some
kind of container/proxy for a group of objects is going reduce the
number of inc/dec calls at the expense of something. There's nothing
particularly state-of-the-art about that idea. Whether the method is
actually applicable depends on the problem. It can be used for Jon's
challenge since there is nothing in the challenge which forces fine
grain sharing between terms. Change the problem slightly (e.g. graph
re-writing) and it is much harder to use a proxy or linearity.

His claims are wrong because he is posting in C++ group which has no GC:

{
refcount<Foo> f(new Foo);
foo(f);
bar();
}

Jon says that GC will collect f while bar is executing. Well, his
claim is totally misleading because he cannot possible know when the
GC is going to decide to run a cycle. I have been over this with
Jon. He refuses to understand. Oh well.

Since you didn't quote Jon I had a look back through the thread for
that example and I can find it in <[email protected]>
where Jon writes "can collect" not "will collect". To me that makes a
big difference as to whether it is misleading or not.

He even slammed Linux, and insulted Linus Torvalds intelligence
because he happened to approve a proxy GC over a traditional one...

Whether he slammed Linux or insulted Linus is irrelevant. Just
because I don't agree with him about one issue doesn't mean I
automatically dismiss what he says about another.
 
S

Stephen J. Bevan

Indeed, but for the most part humans write programs and humans make
mistakes. The important thing is whether the mistakes can easily be
found or whether they can be missed and only show up in production.
I'm reasonably confident that anyone who has used RC in a non-trivial
program (i.e. at least measured in tens of thousands of lines of code,
written by multiple people and actively used by third parties) has
some war stories about spending time trying to track down bugs due to
RC errors. I certainly have, but I have no choice but to use RC in
one or more of its forms. That doesn't mean I'm oblivious to its
problems and wouldn't prefer to use GC if it were feasible (memory
constrained environment).

BTW, would you call GC fragile because a programmer can still leak
memory? In certain scenarios, forgetting to set a reference to NULL is
analogous to forgetting to call free.

GC and RC share one failure mode: potential to retain memory that is
semantically free. RC (when used manually and not generated by a
compiler) has the other which is to reclaim too early. Neither mode
is good. But two modes of failure puts RC higher up the fragility
scale than GC.
 
S

Stephen J. Bevan

Chris Thomasson said:
The most important aspect of state-of-the-art reference counting is
the fact that no dependence on any interlocked RMW and/or
memory-barrier instructions are needed at all. Of course, this means
that various MAJOR benefits in the realm of scalability, throughput
and performance can be realized indeed.

Despite the thread subject, Jou's challenge is single threaded thus
RMW, memory barriers, scalability ... etc. aren't an issue.

Proxy GC works very well within certain problem arenas. For instance,
classic reader-writer problem is efficiently solved through clever use
of the technique.

Proxy GC may be fantastic in its area, but unless you think that Jon's
challenge is in this area and are willing to write the code and time
it then Proxy GC is irrelevant.

Of course a GC "can" collect f while bar is executing. The
misleading part is that he fails to explain that "can" is predicated
on the fact that a GC must run a cycle while bar is executing.

He probably failed to explain it because he assumed it was obvious
that the GC would have to run. I wouldn't call it misleading to
assume that someone knows a basic fact about GC.

A small tweak to the program changes things:

{
{
refcount<Foo> f(new Foo);
foo(f);
}
bar();
}


Now, a naive RC WILL collect f _before_ bar has executed. Of course
Jon said that this fact is totally irrelevant.

The reason being that most people don't write code like that from
scratch, they'd only do it once if/when they'd discovered a problem
that forced them to do it. Jon is selling the idea that if you use GC
then you'd never have to go through the problem discovery stage, the
GC would take care of it for you. Whether anyone is buying depends on
how often they see this kind of problem. It isn't high up on my list.

AFAICT, his insults against Linux and/or Linus are based in a form of
ignorance. Much like his attribution of several problems to _all_
forms of RC. IMVHO, its relevant indeed.

To you, but not to me.
 
J

Jerry Coffin

[ ... ]
GC and RC share one failure mode: potential to retain memory that is
semantically free. RC (when used manually and not generated by a
compiler) has the other which is to reclaim too early. Neither mode
is good. But two modes of failure puts RC higher up the fragility
scale than GC.

How would RC free memory that's still in use, short of a bug in the RC
code itself?
 
S

Stephen J. Bevan

Jerry Coffin said:
[ ... ]
GC and RC share one failure mode: potential to retain memory that is
semantically free. RC (when used manually and not generated by a
compiler) has the other which is to reclaim too early. Neither mode
is good. But two modes of failure puts RC higher up the fragility
scale than GC.

How would RC free memory that's still in use, short of a bug in the RC
code itself?

This isn't a RC bug per se rather it is a bug that is inherent to all
manual memory (de)allocation methods: human error in writing the
program such that in some context an extra put/release is called which
causes memory to be freed while another part of the program still
thinks it has a (final) reference to the object and so can access it
safely.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,756
Messages
2,569,540
Members
45,024
Latest member
ARDU_PROgrammER

Latest Threads

Top