H
Harry Ohlsen
Some background ...
I have an application where there are many identical strings (the data consists of huge chunks of XML, with a lot of duplication in both the tag names and the CDATA content).
I've written a tiny XML parser in C, because trying to load these documents using REXML ran all night and was still running the next day, presumably due to the size (hundreds of thousands of tags).
Anyway, to reduce the memory used, given the repetitive nature of a lot of the data, I decided to store the strings as a (C coded) hash table of VALUE objects.
Changes to the data are very, very few, so when this happens, I just create a new Ruby string, so the values in the hash table never change.
Now, to my questions ...
I found that when I played with particularly large documents, my code fell over with what looked like some kind of memory corruption. I eventually twigged to the fact that Ruby might be garbage collecting some of the strings I'd constructed, because my C code wasn't doing any rb_gc_mark() calls. That definitely seemed to be the story, because when I wrote one that just went through the entire hashtable and marked each value, the corruption disappeared.
So, I guess my questions are: (1) is this likely to be what was really going wrong, or did adding the rb_gc_mark() calls fix the problem by pure luck and it's waiting to bite me again, further down the track; (2) is there some way I can mark all of those objects as always being in use, so that they'll never be considered for garbage collection; and more importantly (3) is there a better way to do achieve this?
Thanks in advance,
Harry O.
I have an application where there are many identical strings (the data consists of huge chunks of XML, with a lot of duplication in both the tag names and the CDATA content).
I've written a tiny XML parser in C, because trying to load these documents using REXML ran all night and was still running the next day, presumably due to the size (hundreds of thousands of tags).
Anyway, to reduce the memory used, given the repetitive nature of a lot of the data, I decided to store the strings as a (C coded) hash table of VALUE objects.
Changes to the data are very, very few, so when this happens, I just create a new Ruby string, so the values in the hash table never change.
Now, to my questions ...
I found that when I played with particularly large documents, my code fell over with what looked like some kind of memory corruption. I eventually twigged to the fact that Ruby might be garbage collecting some of the strings I'd constructed, because my C code wasn't doing any rb_gc_mark() calls. That definitely seemed to be the story, because when I wrote one that just went through the entire hashtable and marked each value, the corruption disappeared.
So, I guess my questions are: (1) is this likely to be what was really going wrong, or did adding the rb_gc_mark() calls fix the problem by pure luck and it's waiting to bite me again, further down the track; (2) is there some way I can mark all of those objects as always being in use, so that they'll never be considered for garbage collection; and more importantly (3) is there a better way to do achieve this?
Thanks in advance,
Harry O.