Keith said:
Disclaimer: I haven't read the web page, but (I think) I have a
general idea of how the GC works.
I don't *think* what you describe is going to be much of a problem in
practice, though it's certainly a theoretical problem. In pratice, I
suspect that most valid addresses tend to have different bit patterns
than most other valid data.
Yep, many systems start handing out addresses that have a few of the
high address bytes as zeroes, so addresses tend to be discernible if
you don't push the address range very far. And some addresses tend to
be 2/4/8/16 byte aligned when first handed out, so that thins the vald
address range somewhat, at least until the program starts indexing into
arrays.
But as soon as you ask for more than 16-bits of memory, the high byte
on a 32-bit architecture will be zero, followed or preceded by three
non-zero bytes (depending on address ordering), and that will be
mimicked by zero-terminated strings.
Worse yet, as soon as you ask for more than 24-bits of memory, the high
byte on a 32-bit architecture will be non-zero, making many string
bodies mimic addresses. Very bad.
I guess the moral is, when addresses start looking like data, switch to
64-bit compilers!
But I don't think it's likely that a program would pass a
pointer value to a system API *and forget it*.
Well, yes, usually true, two or three thorny canonical problems:
(1) The app may keep the pointer, so the GC won't toss out the block,
but there are many OS's where you can pass in arbitrary arrays or
structs or even linked lists. For example a database app might pass an
array to the OS meaning "gather up these 3,200 pairs of random disk
blocks and put them in this other block of pointers to addresses in my
memory space". The GC would have (usually) no way to intercept that OS
call, and no intrinsic knowledge that the block passed has addresses in
it. Worse yet, the app need NOT keep a pointer to this request
structure, as in most cases the OS will asyncronously call back to the
user program, passing back the request block address, where the OS is
returning result codes. This is very common in Windows NT/XP. So
yipes, the app for a while may not have any trace of these addresses.
(2) There are OS calls to request the OS to allocate memory and return
the virtual address where it's given the app memory. The GC may not
have any way to hook this call and learn about those addresses.
So yes, this clever GC may be able to root around and figure out where
most blocks are, as long as addresses don't get too large, and apps
don't make any fancy OS calls. Whether this is a tenable situation
probably varies a lot from case to case.
I'd love to have a reliable GC for C. Last week I had what I thought
was a pretty clean C program, but when I used my malloc_watcher, at the
end it said "84,132 blocks using 68,321,144 bytes left dangling at
exit(0) time". I had forgotten to free() some large linked lists.
Sigh.