Garbage collection problems

C

Chris Dollin

jacob said:
I have never seen a program that XOR pointers.

It was at one point the standard hack to do doubly-linked lists
with just one link. (Not in C, perhaps. I seem to remember that
I came across it for the PDP-8, except, since the PDP-8 didn't
have a XOR [1], one used TAD instead.)

[1] For almost all X you might have expected from a processor,
the PDP-8 didn't have X. Like a STORE instruction, or index
registers [2].

[2] For some value of "STORE", or index, or registers.
 
R

Richard

jacob navia said:
I have never seen a program that XOR pointers.
Neither one that writes pointers to disk, then reads them back.
Sorry, I am programming in C since relatively few years. I
started around 1984.

Thats silly.

Conceivably a distributed system could store a block of pointers on a
remote node for a while via a remote copy which does not increment local
system reference counts. There are lots of things which can cause it to
fail. And if people become too reliant on it it would be a *nightmare*
to find the cause of a problem since none of the mallocs have frees to
match against.

The bottom line is this (using my typically black and white views of
such things) : Garbage Collection sucks. It promotes lazy programming
and a "get it all for free" mentality and is partly responsible for the
bloatware which is now regularly released .

Few seem to consider memory footprint anymore.
 
T

Tor Rustad

cr88192 said:
[...]
IMO, the main domain of C is system and embedded development. Even if
extending this domain by including communication, security development and
DB engine development, a GC seems neither important, or of much interest.

errm, are you trying to claim that all of us good old desktop-pc developers
are all off using stuff like Java and C# or something?...

No, I say that C1X should focus on the main domain of C, and try to keep
the language small and simple. The C99 variable-length arrays was a step
too far. Adding GC to Standard C, would IMO be a major mistake.
must of us are in a land where we still want things like GC, but don't feel
like selling our souls to some proprietary VM framework to get it.

Nobody is stopping you from using the Boehm GC.
 
T

Tor Rustad

santosh said:
I guess that intentionally obfuscated (for whatever reasons) programs
might XOR their pointers.

Yup, I have seen this trick used in software based copy-protection
schemes, the intent was to make the job harder for crackers &
dis-assemblers.

The old AARD code used by Microsoft in win.com, was far worse than
that... debuggers was defeated by pointing interrupts to invalid code,
code itself was XOR encrypted. Duh.. XOR'ed pointers is nothing,
compared to the Windows AARD code.

See "Undocumented DOS" by Andrew Schulman.
 
P

Paul Hsieh

It forces call stack uniformity (the EH needs to be able to crawl up
the stack and unwind the stack in a uniform way.) So for example,
tail-call recursion might not be able to be turned into gotos if there
is a potential for exceptions to be thrown under a call inside that
tail recursion (actually, I don't know that for sure -- but its one of
those things you would have to check very carefully). I also know
that compilers such as WATCOM C/C++ definitely use very non-uniform
stack frames whenever they can get away with it, for leaf-function
optimization reasons.
If an entire program doesn't use any EH, then you'd expect the overhead to
be minimal.

If the entire program is one file, then sure -- the compiler could
detect this and drop the overly-uniform call-stack frames. However,
more real projects are multi-file, and not very many compilers do link-
time recompiling (though Intel's compiler does that).
It might not be zero, unless there's a way to tell the compiler to
compile compilation units assuming that they'll be used in an EH-free
program: it's possible that code that might execute in an EH
environment might suffer optimisation penalties. On the other hand,
I don't see that they'd be much different from any penalties already
paid for set/longjmp.

The set/longjmp storage is handled by the programmer. With EH, it has
to dynamically be stored somewhere -- presumably in the stack itself
(this seems to be the only solution that would make sense in multi-
threaded environments.)
I seem to recall that earlier C++ compilers had this sort of EH-controlling
switch. (Not having using C++ in anger recently, I don't know about
current C++ compilers, not even specifically g++.)

Watcom and Microsoft definitely retain those switches.
 
J

jacob navia

Paul said:
It forces call stack uniformity (the EH needs to be able to crawl up
the stack and unwind the stack in a uniform way.) So for example,
tail-call recursion might not be able to be turned into gotos if there
is a potential for exceptions to be thrown under a call inside that
tail recursion (actually, I don't know that for sure -- but its one of
those things you would have to check very carefully). I also know
that compilers such as WATCOM C/C++ definitely use very non-uniform
stack frames whenever they can get away with it, for leaf-function
optimization reasons.

This is only true if your language needs destructors.
C doesn't need that, so all that does not apply.
 
P

Paul Hsieh

Jacob, don't quote figures you fabricated then use anecdotal evidence to
back them up. That's just not science.

Lol! Can you point me to any other "science" being practiced in this
newsgroup? Come on now, aside from some pointer arithmetic (which
Boehm has not problem with)
[...] Unless you have done proper
research across a wide cross-section of application domains and coding
groups, the fact that you've never seen one means precisely zilch. (I
have seen precisely zero uses of ptrdiff_t - I guess it doesn't exist,
right?)

The Better String Library uses ptrdiff_t to deal with aliasing in a
conservative and portable way.
It doesn't matter how many people can use GC successfully, it matters
how many /can't/

Jacob's response to you here is very on point. GC is being proposed
to deal with the much worse problem of dealing with malloc and free.
The problems that might happen due to misuse of GC will be a drop in
the ocean compared with the rampant memory leaks C programmers create
today.
[...] - and because the sort of problems that we're talking
about, the errors are of the intermittent runtime difficult-to-debug
variety. If I can't be absolutely sure that I can use your GC on my
standard C program, I won't bother. I'd have to go through the entire
program to make sure it doesn't do 'clever' stuff with pointers; or
start a new C-with-GC program from scratch (but I wouldn't do that
because I'd be stuck with one compiler on one platform).

You are worried about non-opaque (and non-pointer offsetting) pointer
manipulation? No matter what it is you have in mind, its almost
certainly a violation of the standard. Writing pointers to a file
then reading them back? That's extremely brittle coding that I would
be very scared of -- disks can fail, and be accessed asynchronously by
other processes.

Real programs don't do this sort of nonsense. We're talking about a
percentage that is way less than the coverage missed by a typical test
suite for even the best software.

This is why the Boehm collector exists, and is being used in real
production code today.
 
F

Francine.Neary

How many applications you have seen that write pointers into
disk files, xor pointers or do such kind of nonsense?

I have never seen one.

Surely the point is that if you want to propose a *standardized* GC
(which would be better done in comp.std.c), then what does and doesn't
break it will need to be specified!

You could either do this negatively:
"As long as you don't xor pointers or write them to disk or ...
(expand 'such kind of nonsense' here) or ... then your program can use
GC safely"

or positively:
"You're allowed to store arbitrarily many pointers in arbitrarily deep
linked sequences, add offsets to them, cast them(?), ..., and GC is
guaranteed to work"

Either way, it seems to me that you'll end up with an extremely long
and unwieldy list of conditions if you want to cover all bases. If you
can't express an idea cleanly, that might indicate that there's
something wrong with it...
 
C

cr88192

Charlton Wilbur said:
cr> after all, if no one really wanted GC, then Java and C# would
cr> leave them out, and things like Boehm, and the endless other
cr> custom GC frameworks, would simply not be used.

Of course. *Someone* wants garbage collection. At times, even *I*
want garbage collection. But you cannot go from that statement, which
you have provided adequate evidence for, to your original claim that
*most* programmers want garbage collection without a whole lot more
evidence than you have thus far provided.

I see enough benefits to having a memory management scheme that's
completely deterministic and completely in the hands of the programmer
that I don't want to see it go away as an option. And (as has been
pointed out) the semantics of C make it very difficult to use a
garbage collector that's implemented as a library; I think the
effort's better spent elsewhere.

who ever said I wanted malloc to go away?...

I say, we have the GC, and we have malloc.

maybe they can share the same implementation, or maybe they dont.
in any case, I am not arguing that we abandon malloc and put everything in
the hands of the GC, rather, I argue, we have both options available, and
use each where it makes sense.

so, I argue for availability, and that it is a worthwhile option, not
mandatory usage, or even abandoning manual memory management approaches
(when it so happens that they make the most sense...).

for example, my new gc has a function, said, 'gcfree()', and what does it
do? it frees things. why? because freeing something ealier makes things a
lot faster than waiting for the heap to get full and the GC to reclaim it...


as for said "difficulty". I have used garbage collection in my projects for
many years, and have never had much difficulty with it (apart from dealing
with certain cases, such as a previous version of my GC, when used as the
core of a previous script lang, being too damn slow and having to collect
the heap too damn often...).
 
C

cr88192

Paul Hsieh said:
Lol! Can you point me to any other "science" being practiced in this
newsgroup? Come on now, aside from some pointer arithmetic (which
Boehm has not problem with)

it is the science of edge-cases and arguing...

[...] Unless you have done proper
research across a wide cross-section of application domains and coding
groups, the fact that you've never seen one means precisely zilch. (I
have seen precisely zero uses of ptrdiff_t - I guess it doesn't exist,
right?)

The Better String Library uses ptrdiff_t to deal with aliasing in a
conservative and portable way.
It doesn't matter how many people can use GC successfully, it matters
how many /can't/

Jacob's response to you here is very on point. GC is being proposed
to deal with the much worse problem of dealing with malloc and free.
The problems that might happen due to misuse of GC will be a drop in
the ocean compared with the rampant memory leaks C programmers create
today.

yes.
I use to do this some before I discovered GC.

if I could not determine well enough if/when things needed to be freed, I
didn't bother.
I also tended to end up with apps that would randomly "explode" as well...


eventually, I started implementing custom memory managers, and later,
garbage collectors, as a way of battling this problem...

the general structure of these allocators tends to remain the same though:
I allocate memory in large "chunks" (say, 256kB or 1MB). these chunks were
then divided into a large number of objects, which were allocated and freed.

some of my early allocators used linked lists and best-fit, but tended to be
kind of slow (especially when I had mark/sweep GCs, as a long time during
sweep was used up positionally inserting every freed object into the free
list...).

later on (starting in about 2002 and since), pretty much all my allocators
have been cell-based (each block is a large array of fixed-size cells
managed by bitmaps).

[...] - and because the sort of problems that we're talking
about, the errors are of the intermittent runtime difficult-to-debug
variety. If I can't be absolutely sure that I can use your GC on my
standard C program, I won't bother. I'd have to go through the entire
program to make sure it doesn't do 'clever' stuff with pointers; or
start a new C-with-GC program from scratch (but I wouldn't do that
because I'd be stuck with one compiler on one platform).

You are worried about non-opaque (and non-pointer offsetting) pointer
manipulation? No matter what it is you have in mind, its almost
certainly a violation of the standard. Writing pointers to a file
then reading them back? That's extremely brittle coding that I would
be very scared of -- disks can fail, and be accessed asynchronously by
other processes.

Real programs don't do this sort of nonsense. We're talking about a
percentage that is way less than the coverage missed by a typical test
suite for even the best software.

it is brittle, but I have heard of a few apps doing this kind of thing.
an example of this is some forms of persistent store, where part (or all) of
the heap is stored to a file.

of course, these kind of apps usually do a custom heap (a custom memory
manager over some large mmaped block of memory), always mmap the file in the
same place, and disallow any pointers to memory outside this region.


of course, this has little to do with whether or not the implementation has
GC, since such memory is completely outside the control of the GC anyways...

of course, this is only one of several ways to implement a persistent
store...

This is why the Boehm collector exists, and is being used in real
production code today.

yes, it is used, because it is usable...

 
C

cr88192

Surely the point is that if you want to propose a *standardized* GC
(which would be better done in comp.std.c), then what does and doesn't
break it will need to be specified!

You could either do this negatively:
"As long as you don't xor pointers or write them to disk or ...
(expand 'such kind of nonsense' here) or ... then your program can use
GC safely"

or positively:
"You're allowed to store arbitrarily many pointers in arbitrarily deep
linked sequences, add offsets to them, cast them(?), ..., and GC is
guaranteed to work"

Either way, it seems to me that you'll end up with an extremely long
and unwieldy list of conditions if you want to cover all bases. If you
can't express an idea cleanly, that might indicate that there's
something wrong with it...

I don't think the list has to be "that" long or unweildy (after all, most of
the C standard would have to be filled with such lists in that case).

for example, here are a few general requirements for my GC:
1. pointers to GC'ed objects need to be located in GC-reachable memory to be
GC-reachable;
2. pointers need to be stored in aligned memory locations (this much is done
by the compiler);
3. a pointer is required to point somewhere inside the object being pointed
to;
4. an GC'ed object is not required to be retained if not GC-reachable at any
point in time.

and, provisions:
data, bss, and the main and gc-known thread stacks are reachable;
the search depth is only bounded by the ability to create a sufficiently
large trace stack;
....

so, no xor'ing, as this violates 3.
no files, strings, or malloc'ed memory, as this violates 1 and 4;
care needs be taken when mixing GC and byte-arrays, due to 2.

for example:
char a[4];
*((void **)a)=gcalloc(256);
is in potential violation of 2.

....

 
C

cr88192

Tor Rustad said:
cr88192 said:
[...]
IMO, the main domain of C is system and embedded development. Even if
extending this domain by including communication, security development
and DB engine development, a GC seems neither important, or of much
interest.

errm, are you trying to claim that all of us good old desktop-pc
developers are all off using stuff like Java and C# or something?...

No, I say that C1X should focus on the main domain of C, and try to keep
the language small and simple. The C99 variable-length arrays was a step
too far. Adding GC to Standard C, would IMO be a major mistake.

I will agree to a point here:
those kind of arrays, IMO, do not belong within the main language.

but, GC is a runtime feature, and it is very sensible that it be left out
for embedded targets.


I was not arguing for standardized GC though, only that GC, itself, has
value.
where and for who it has value, are the people who use it.

to what extent standardization would make sense, would be in terms of a
"standardized" definition of the API and provided functionality and
semantics (said, "portable" GC), but I do not expect by any means that it be
somehow "required" (as is the case with Java and C#, where GC is an assumed
and fundamental part of the language).

Nobody is stopping you from using the Boehm GC.

what do you think I was arguing through most of the thread?...

this is what I was arguing, that using Boehm is a very valid and sensible
option (however, certain other people were opposing that GC has use in C
land, ever...).

 
C

CBFalconer

jacob said:
.... snip ...

This change doesn't look good for being able to use third party
libraries compiled with a different compiler...

You can't count on that anyhow. In fact, you expect failure.
 
C

CBFalconer

Paul said:
.... snip ...

It forces call stack uniformity (the EH needs to be able to crawl up
the stack and unwind the stack in a uniform way.) So for example,

To bring things back to reality, there is no reason to have a stack
in any C system.
 
C

cr88192

CBFalconer said:
You can't count on that anyhow. In fact, you expect failure.

theoretically, yes.

in practicality, one is forced to deal with this issue.

for example, between windows and linux, there are any number of compilers
you can compile code with, and they can make dlls and static libraries.

if one makes their compiler incompatible, say, by using a different calling
convention, then they are unable to use really any code (static libraries,
DLLs, ...) for their platform, and the compiler becomes essentially useless.


so, on x86, there is a singular more or less mandatory calling convention
(cdecl), and on windows, another less-common but still mandatory convention
(stdcall).

on x86-64, there are 2 calling conventions, the SystemV x86-64 convention
(followed by linux and friends), and the Win64 calling convention (windows).

and, again, to be useful on these archs, one needs to support these
conventions (even as such, I have been vaguely tempted to just use my own
convention for x86-64, but this would require, of all horrors, stubbing, to
interface external code).

what convention would I use:
I would actually just "upgrade" the existing 32-bit x86 calling convention
to work on x86-64...
that or use the Win64 convention internally, which should not be too
difficult to support (would just need to come up with some means of creating
a Win64 to SysV adaptor interface...).

or such...
 
C

cr88192

CBFalconer said:
To bring things back to reality, there is no reason to have a stack
in any C system.

errm?...

what kind of 'reality' is it you are living in?...


I guess it is theoretically possible, but it would lead to incompatibility
and what reason is there that anyone should implement such a thing?...

I guess, maybe you mean, all the call-frames are allocated on the heap, but
then one has to totally re-engineer the calling conventions, recompile all
the libraries, and likely deal with a pretty damn major performance hit.


I guess it could allow things like scheme style continuations, lazy
evaluation, and simplify using fine-grained concurrency (no more "threads",
only 'workers' and 'work queues').


the fact that pretty much every arch has a stack, and those that don't, fake
it, this is telling of its relevance...
 
J

jacob navia

cr88192 said:
errm?...

what kind of 'reality' is it you are living in?...

Mr Falconer has his own world, as many people here around.

A stack is not necessary, one complement or sign magnitude
representations rule the world, etc etc.

I think we have touched a trap representation
:)
 
C

cr88192

jacob navia said:
Mr Falconer has his own world, as many people here around.

A stack is not necessary, one complement or sign magnitude
representations rule the world, etc etc.

I think we have touched a trap representation
:)

yes, ok.


but, it is maybe interesting from a theoretical perspective, aka:
a C-implementation allocating all the call frames on a heap.

so, we want a call frame?
grab it from a free-list (or, somehow, allocate more if needed).

need to store locals? grab a chunk of memory for this and attach to call
frame.
need space for args? likewise, grab a chunk of memory, evaluate and store in
the args, and attach it to the (newly created) call frame of the function
being called.

....

afaik, this particular approach is used in many scheme implementations, for
the reason that it makes implementing call/cc really fast and easy (we can
jump outword to some enclosing call frame, do stuff, or jump right back into
a call frame we had previously jumped out of).

some people have used call/cc as a way of implementing multithreading from
within scheme itself, ...


however, such an idea, though possible, would result in something very
different than a traditional C compiler.

very likely, we would have to rework the C code into CPS (actually, this is
partly how the JIT backend for my last script VM worked, as it compiled the
single bytecode blocks into some number of CPS-linked chunks).

CPS:
http://en.wikipedia.org/wiki/Continuation_passing_style


maybe external functions, like API calls, could be handled by temporarily
assigning a conventional stack (this stack serving to disappear once the
call completes). an inward call would mostly call a stub which serves to
transfer from a more traditional stack-based call-frame to the heap-and-CPS
variant...


of course, in my case, something like this would require an almost
completely different implementation of my lower compiler...

of course, it would make a whole lot of sense if I ever really intended to
(effectively) support languages like scheme or haskell...


interesting? maybe.
practical? probably not.
 
J

jacob navia

cr88192 said:
yes, ok.


but, it is maybe interesting from a theoretical perspective, aka:
a C-implementation allocating all the call frames on a heap.

so, we want a call frame?
grab it from a free-list (or, somehow, allocate more if needed).

need to store locals? grab a chunk of memory for this and attach to call
frame.
need space for args? likewise, grab a chunk of memory, evaluate and store in
the args, and attach it to the (newly created) call frame of the function
being called.

So, you are implementig the stack in software. Instead of doing a
single register subtract, you allocate memory, put that in the list
of active frames, etc etc!

That would be really bad for performance
...

afaik, this particular approach is used in many scheme implementations, for
the reason that it makes implementing call/cc really fast and easy (we can
jump outword to some enclosing call frame, do stuff, or jump right back into
a call frame we had previously jumped out of).

call/cc is completely UNREADABlE. You do not see in the program text
where the program is going, you have to figure out what is the current
continuation!!!!!

That was something completely NUTS!

some people have used call/cc as a way of implementing multithreading from
within scheme itself, ...

Fine... When it works. When it doesn't, I would not like to be
the one debugging that stuff.
interesting? maybe.
practical? probably not.

I agree with that!
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,009
Latest member
GidgetGamb

Latest Threads

Top