An accurate, pauseless & deterministic GC for C++

A

Arnold Hendriks

Juha Nieminen said:
That my be true in Europe but not in the US.
In Europe, software patents cannot be enforced at all. (Although they can be
registered, to keep the EPO busy)
 
C

Chris Thomasson

Mingnan G. said:
By downloading the source code, you will find in HNXGC_MT directory
which uses the Global Memory Fence technology to elide acquire/
release
semantics in a multi-processors environment, such as for Windows 64-
bit
IA64 platform. The binary code for Windows 64-bit IA64 is also
available.

Please inform us when you have filled in the following page:

http://hnxgc.harnixworld.com/algo_gmf.htm



Humm, well I think I well go ahead and sign-up and download the source-code
sometime today.
 
C

Chris Thomasson

Mingnan G. said:
By downloading the source code, you will find in HNXGC_MT directory
which uses the Global Memory Fence technology to elide acquire/
release
semantics in a multi-processors environment, such as for Windows 64-
bit
IA64 platform. The binary code for Windows 64-bit IA64 is also
available.

[...]

I can't seem to find the sign-up page. I sent you an e-mail requesting
further instruction.

Thanks.
 
C

Chris Thomasson

Chris Thomasson said:
Mingnan G. said:
By downloading the source code, you will find in HNXGC_MT directory
which uses the Global Memory Fence technology to elide acquire/
release
semantics in a multi-processors environment, such as for Windows 64-
bit
IA64 platform. The binary code for Windows 64-bit IA64 is also
available.

[...]

I can't seem to find the sign-up page. I sent you an e-mail requesting
further instruction.

STUPID ME!!!!

I found it, and have downloaded your source-code. Anyway, it seems like your
using some form of reference counting (need to examine the code to find you
exactly algorithm) and using tracing GC to handle all of the circular
references.
 
C

Chris Thomasson

Joe Seigh said:
(sorry for the repost but it probably would help if I actually included
c.p.t) [...]
CC'ing c.p.t

I don't know what the GlobalMemoryFence is yet since that part is not
filled in yet. Generally with shared pointers you need acquire/release
semantics which can't be elided.
[...]

I took a quick look at the code. AFAICT, he creates a thread per-processor
and binds it to that processor.


So if your running an a 64-processor system, he creates:

Thread 1 - Bound To CPU 1
Thread 2 - Bound To CPU 2
Thread 3 - Bound To CPU 3
Thread 4 - Bound To CPU 4
Thread 5 - Bound To CPU 5
Thread 6 - Bound To CPU 6

[and on and on to 64]

Those threads basically sit on an event and wait. When they are awoken, they
execute a membar and atomically decrement a counter and check to see if it
dropped to zero, which means that every "CPU" thread has executed it.
AFAICT, this is exactly the same as the synchronize_rcu() function
implemented with a daemon per-cpu. So, there is really nothing new here at
all. I think he is going to have a problem with prior are for sure.

As for the smart pointer reference counting, I haven't looked yet as I was
mainly interested in GlobalMemoryFence (a.k.a, synchronize_rcu()).


Any thoughts?
 
M

Mingnan G.

c.p.t) [...]
CC'ing c.p.t
I don't know what the GlobalMemoryFence is yet since that part is not
filled in yet. Generally with shared pointers you need acquire/release
semantics which can't be elided.

[...]

I took a quick look at the code. AFAICT, he creates a thread per-processor
and binds it to that processor.

So if your running an a 64-processor system, he creates:

Thread 1 - Bound To CPU 1
Thread 2 - Bound To CPU 2
Thread 3 - Bound To CPU 3
Thread 4 - Bound To CPU 4
Thread 5 - Bound To CPU 5
Thread 6 - Bound To CPU 6

[and on and on to 64]

Those threads basically sit on an event and wait. When they are awoken, they
execute a membar and atomically decrement a counter and check to see if it
dropped to zero, which means that every "CPU" thread has executed it.
AFAICT, this is exactly the same as the synchronize_rcu() function
implemented with a daemon per-cpu. So, there is really nothing new here at
all. I think he is going to have a problem with prior are for sure.

As for the smart pointer reference counting, I haven't looked yet as I was
mainly interested in GlobalMemoryFence (a.k.a, synchronize_rcu()).

Any thoughts?

synchronize_rcu is for locking(lock-free), GlobalMemoryFence is for
memory semantics for MP.
Totally two different things.
 
J

Joe Seigh

Chris said:
Joe Seigh said:
(sorry for the repost but it probably would help if I actually
included c.p.t) [...]
CC'ing c.p.t

I don't know what the GlobalMemoryFence is yet since that part is not
filled in yet. Generally with shared pointers you need acquire/release
semantics which can't be elided.
[...]

I took a quick look at the code. AFAICT, he creates a thread
per-processor and binds it to that processor.


So if your running an a 64-processor system, he creates:

Thread 1 - Bound To CPU 1
Thread 2 - Bound To CPU 2
Thread 3 - Bound To CPU 3
Thread 4 - Bound To CPU 4
Thread 5 - Bound To CPU 5
Thread 6 - Bound To CPU 6

[and on and on to 64]

Those threads basically sit on an event and wait. When they are awoken,
they execute a membar and atomically decrement a counter and check to
see if it dropped to zero, which means that every "CPU" thread has
executed it. AFAICT, this is exactly the same as the synchronize_rcu()
function implemented with a daemon per-cpu. So, there is really nothing
new here at all. I think he is going to have a problem with prior are
for sure.

As for the smart pointer reference counting, I haven't looked yet as I
was mainly interested in GlobalMemoryFence (a.k.a, synchronize_rcu()).

I figured it was some form of RCU w/ memory barriers in the quiescent states.
That by itself doesn't let you elide memory barriers. Whether you can elide
a membar depends on the specific situation. And the proofs can be a bit
complicated depending on the situation (at least for SMR+RCU).

I'm guessing from the docs, the refcounting is a bit like atomic_ptr with
the distinguishing of local non-shared vs. global heap shared reference counts.

With refcounting, it was always safe to use a raw reference as long as you had
(owned) one refcount. There seems to be a wrapper class that is the logical
equivalent.

The burden of proof is on him. He needs to show how his stuff is different
from prior art. That means explicitly listing the prior art and contrasting
it with his stuff.
 
C

Chris Thomasson

Mingnan G. said:
(sorry for the repost but it probably
would help if I actually included
c.p.t) [...]
CC'ing c.p.t
I don't know what the GlobalMemoryFence is yet since that part is not
filled in yet. Generally with shared pointers you need acquire/release
semantics which can't be elided.

[...]

I took a quick look at the code. AFAICT, he creates a thread
per-processor
and binds it to that processor.

So if your running an a 64-processor system, he creates:

Thread 1 - Bound To CPU 1
Thread 2 - Bound To CPU 2
Thread 3 - Bound To CPU 3
Thread 4 - Bound To CPU 4
Thread 5 - Bound To CPU 5
Thread 6 - Bound To CPU 6

[and on and on to 64]

Those threads basically sit on an event and wait. When they are awoken,
they
execute a membar and atomically decrement a counter and check to see if
it
dropped to zero, which means that every "CPU" thread has executed it.
AFAICT, this is exactly the same as the synchronize_rcu() function
implemented with a daemon per-cpu. So, there is really nothing new here
at
all. I think he is going to have a problem with prior are for sure.

As for the smart pointer reference counting, I haven't looked yet as I
was
mainly interested in GlobalMemoryFence (a.k.a, synchronize_rcu()).

Any thoughts?

synchronize_rcu is for locking(lock-free),

synchronize_rcu() produces the exact same overall effect as
GlobalMemoryFence. When the daemon run on each CPU, it constitutes a global
memory barrier. The well known technique is not tied to just the writer-side
of a non-blocking reader-pattern. You can use this for other things. For
instance, I could implement the reference counting algorithm you use with
synchronize_rcu(), no problem.



GlobalMemoryFence is for
memory semantics for MP.
Totally two different things.

I believe that they do the same thing. BTW, have you looked at Joe Seighs
prior art? He has a more scaleable method of extracting "global" quiescent
states:

http://atomic-ptr-plus.sourceforge.net


I had done some in-house work on this in an experimental setting so luckily
I knew how to sketch out a solution to Joe Seighs challenge to show a
Windows implementation of a "global" memory fence:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/5d4e81b46352905d


I say that those are more scaleable because you don't have to create a
thread per-cpu, and issue a global broadcast. I don't think that would go
over well with a distributed NUMA system with a shitload of processors. I
could make your system work on a NUMA, however you would have to change your
API to allow for thread grouping.
 
C

Chris Thomasson

Joe Seigh said:
Chris said:
news:ZmZ%i.2695$Jy1.83@trndny02...
[...]

I figured it was some form of RCU w/ memory barriers in the quiescent
states.
That by itself doesn't let you elide memory barriers. Whether you can
elide
a membar depends on the specific situation. And the proofs can be a bit
complicated depending on the situation (at least for SMR+RCU).

Yeah. For instance, the polling logic needs to do some things (e.g., add
extra epoch) to allow you to have release semantics. If your doing your fast
SMR, the polling logic needs to scan the hazard pointers in a way that gives
them acquire semantics. Been there done that. ;^)


I'm guessing from the docs, the refcounting is a bit like atomic_ptr with
the distinguishing of local non-shared vs. global heap shared reference
counts.

Thats what I am leaning to. I have not examined the refcount logic in the
source-code I downloaded enough to make a definitive answer. The first thing
I will check for is wether or not it uses interlocking instructions. If not,
then it has to be a form of distributed counting. If so, I will compare and
contrast against the method I created for vZOOM ref-counting:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/1216861ac568b2be



With refcounting, it was always safe to use a raw reference as long as you
had
(owned) one refcount. There seems to be a wrapper class that is the
logical
equivalent.

Indeed. Its also safe to take and use a non-NULL raw "read" reference w/
dd-load from a data-structure being protected by a proxy collector:


pc_read& read = proxy_read(&pc_anchor);

// inside a read-only epoch
//_______________________________________
node* n = load_depends(&shared_lifo);
while (n) {
node* const nx = load_depends(&n->nx);
read_only_process(n);
n = nx;
}

proxy_release(read);



The burden of proof is on him. He needs to show how his stuff is
different
from prior art. That means explicitly listing the prior art and
contrasting
it with his stuff.

Yup. He also has to do it with fewer claims than before; the USPO issued new
rules that effect the number of claims you can have. Here the are:


___________________________________________________________________

(A) - No more than twenty-five total claims.


(B) - No more than five independent claims.


(C) - No more than two continuations per-patent.


(D) - Identify all co-pending patent applications.

___________________________________________________________________


Luckily, my vZOOM stuff passes all the above. Although, I was wondering how
RCU is going to explain the numerous continuations? I think that is going to
be a problem...
 
M

Mingnan G.

HnxGC can perform Lock-free concurrent garbage collection even without
GlobalMemoryFence. GlobalMemoryFence is designed to remove extra
memory ordering semantics of HnxGC, such as Acquire/Release semantics
for IA64. We don't need to detect quiescent states as RCU does to
perform lock-free access.

Although GMF is very interesting, but it is only useful for some
processor platforms. For example, x86 architecture has enforced memory
semantics for atomic operations, so GMF is not needed even in a x86
multi-processors environment. I will describe and discuss GMF later in
the documentation of hnxgc.harnixworld.com

Anyway, thanks for your interests in HnxGC. I don't have experiences
on NUMA, but it will be great if you
can make it work on a NUMA and get some performance benchmarks.
 
C

Chris Thomasson

Mingnan G. said:
HnxGC can perform Lock-free concurrent garbage collection even without
GlobalMemoryFence. GlobalMemoryFence is designed to remove extra
memory ordering semantics of HnxGC, such as Acquire/Release semantics
for IA64.

Sure. But that's more expensive. Can the code as is run without
GlobalMemoryFence and still elude Acquire/Release?


We don't need to detect quiescent states as RCU does to
perform lock-free access.

AFAICT, you using GMF to lower the overhead of reference counter mutations.
I can't quite see if your using it to get release semantics. As for acquire,
well I am talking about avoiding a #StoreLoad | #StoreStore barrier, I guess
the would be a store-acquire to be more precise...



Although GMF is very interesting, but it is only useful for some
processor platforms. For example, x86 architecture has enforced memory
semantics for atomic operations, so GMF is not needed even in a x86
multi-processors environment.

Some sort of GMF is required on x86 to get the #StoreLoad barrier in a
lock-free reader pattern. You can use GC for that pattern an it has similar
concerns indeed. For instance, you have to use a GMF to implement SMR. You
have to keep in mind that the implied memory barrier for stores on x86
(#LoadStore | #StoreStore) is not strong enough from keeping some subsequent
operations from rising above it...



I will describe and discuss GMF later in
the documentation of hnxgc.harnixworld.com

Anyway, thanks for your interests in HnxGC.

Sure. Your stuff looks like it will work. I was just trying to compare and
contrast it against prior art. I can't seem to find a difference between GMF
and synchronize_rcu()...



I don't have experiences
on NUMA, but it will be great if you
can make it work on a NUMA and get some performance benchmarks.

I bet it can work, 95% sure it could... Here is how I did it for my vZOOM
algorithm:

Distributed reference counting and polling based epoch detection with high
locality wrt the distance of the memory from the CPU. You network the local
polling entities with wait-free distributed message passing. That should
scale to thousands upon thousands of processors. Applications are advised to
keep their actions as local as possible.
 
N

Nick Keighley

Mingnan G. wrote:
You worry too much. For a non-commercial educational purpose, you can
apply
for a *FREE* license for this software. See follows:

[blah]

If your development is so breathtakingly innovative and of general
applicability to C++ programming, why don't you make it available to the
whole programming community as free software under a liberal license.

why should he?
 
M

Matthias Buelow

Nick said:
why should he?

Because locking it up with a patent effectively removes it from the
common pool of knowledge (at least in the U.S.). It isn't even as if it
hadn't been invented yet -- it essentially cannot be (re-)invented by
someone else (until the patent runs out, of course) and thus is
effectively lost to humanity (or at least, the US software community.)

As I don't live in the US, I personally couldn't care less about the
software patent nonsense but still.
 
N

Nick Keighley

Nick Keighleywrote:


Because locking it up with a patent effectively removes it from the
common pool of knowledge (at least in the U.S.). It isn't even as if it
hadn't been invented yet -- it essentially cannot be (re-)invented by
someone else (until the patent runs out, of course) and thus is
effectively lost to humanity (or at least, the US software community.)

As I don't live in the US, I personally couldn't care less about the
software patent nonsense but still.

there's a difference between not patenting it and releasing it as
free software.

I don't like software patents either.
 
M

Matthias Buelow

Nick said:
there's a difference between not patenting it and releasing it as
free software.

True.. but such an algorithm, if unpatented (the desirable case), could
be reimplemented relatively easily, thus rendering the original
implementation commercially worthless to its author. So it's rather
logical to me to release it as free software in the first place.
 
C

Chris Thomasson

Chris Thomasson said:
Sure. But that's more expensive. Can the code as is run without
GlobalMemoryFence and still elude Acquire/Release?




AFAICT, you using GMF to lower the overhead of reference counter
mutations. I can't quite see if your using it to get release semantics. As
for acquire, well I am talking about avoiding a #StoreLoad | #StoreStore
barrier, I guess the would be a store-acquire to be more precise...





Some sort of GMF is required on x86 to get the #StoreLoad barrier in a
lock-free reader pattern. You can use GC for that pattern an it has
similar concerns indeed. For instance, you have to use a GMF to implement
SMR.

[...]

I need to make a correction to the last sentence. You can certainly use GMF
to implement SMR, but its not required. You can also use explicit memory
barrier calls...
 
C

Chris Thomasson

Joe Seigh said:
(sorry for the repost but it probably would help if I actually included
c.p.t)
[...]

Some observations on your pointer manipulation functions...

_________________________________________________________________
- I notice that the _hnxgc_assign_lptr function is made up of more than an
interlocked update, or simple distributed reference count increment. I
notice at multiple lines of code that makes calls into other object
functions made up of multiple lines of code themselves.


- I notice that the _hnxgc_assign_mptr function has more lines than
_hnxgc_assign_lptr, and it does make calls to functions made up of multiple
lines of code.


- I am thinking that a CLockedPtr is analog of a shared pointer, and
CWeakPtr is analog of local pointer. Am I Right?


- I don't see support for lock-free operations on the CLockedPtr objects.
Refer to this post:

http://groups.google.com/group/comp.lang.c++/msg/eada4a93d932430e



_________________________________________________________________



Humm, well, I am thinking that pointer the DWCAS version of Joes atomic_ptr
would be good to test against. Its also good to keep in mind that circular
references can be designed around. I can give you some examples if you
want...


Have you benchmarked this against a proxy garbage collector yet? This would
be in the context of a reader/writer solution of course.. I am thinking that
it will outperform your GC simply because the pointer access within the
collected region can be completely naked on most of the existing
architectures out there. Another advantage of this type of garbage
collection is that you can clearly separate the writers from the readers.
Also, refer to last few paragraphs of following message:

http://groups.google.com/group/comp.lang.c++/browse_frm/thread/95007ffdf246d50e


Also, the implementation of a proxy collector can be very lightweight...
Check this out:

http://groups.google.com/group/comp.programming.threads/msg/f83460262d48e916


I used that code as a benchmark against my vZOOM stuff. The code is
basically in a pseudo-code state because some error/sanity checks for
brevity...



Any thoughts on this?
 
C

Chris Thomasson

Chris Thomasson said:
Joe Seigh said:
(sorry for the repost but it probably would help if I actually included
c.p.t)
[...]

Some observations on your pointer manipulation functions...

_________________________________________________________________
- I notice that the _hnxgc_assign_lptr function is made up of more than an
interlocked update, or simple distributed reference count increment. I
notice at multiple lines of code that makes calls into other object
functions made up of multiple lines of code themselves.
[...]

I also notice that you use a barrier in one of those pointer assignment
functions. Also, what does the macro HNX_CONFIG_FASTCALL do in relation to
_hnxgc_assign_lptr?
 
C

Chris Thomasson

[...]
Have you benchmarked this against a proxy garbage collector yet? This
would be in the context of a reader/writer solution of course.. I am
thinking that it will outperform your GC simply because the pointer access
within the collected region can be completely naked on most of the
existing architectures out there. Another advantage of this type of
garbage collection is that you can clearly separate the writers from the
readers. Also, refer to last few paragraphs of following message:

http://groups.google.com/group/comp.lang.c++/browse_frm/thread/95007ffdf246d50e
[...]

Wrong link! Here is the right one:

http://groups.google.com/group/comp.lang.c++/msg/b3e1d10e219908f1
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,774
Messages
2,569,599
Members
45,175
Latest member
Vinay Kumar_ Nevatia
Top