Is there any way to mark an object as "always in use" (specifically,in a C extension)?

H

Harry Ohlsen

Some background ...

I have an application where there are many identical strings (the data consists of huge chunks of XML, with a lot of duplication in both the tag names and the CDATA content).

I've written a tiny XML parser in C, because trying to load these documents using REXML ran all night and was still running the next day, presumably due to the size (hundreds of thousands of tags).

Anyway, to reduce the memory used, given the repetitive nature of a lot of the data, I decided to store the strings as a (C coded) hash table of VALUE objects.

Changes to the data are very, very few, so when this happens, I just create a new Ruby string, so the values in the hash table never change.

Now, to my questions ...

I found that when I played with particularly large documents, my code fell over with what looked like some kind of memory corruption. I eventually twigged to the fact that Ruby might be garbage collecting some of the strings I'd constructed, because my C code wasn't doing any rb_gc_mark() calls. That definitely seemed to be the story, because when I wrote one that just went through the entire hashtable and marked each value, the corruption disappeared.

So, I guess my questions are: (1) is this likely to be what was really going wrong, or did adding the rb_gc_mark() calls fix the problem by pure luck and it's waiting to bite me again, further down the track; (2) is there some way I can mark all of those objects as always being in use, so that they'll never be considered for garbage collection; and more importantly (3) is there a better way to do achieve this?

Thanks in advance,

Harry O.
 
N

nobu.nokada

Hi,

At Fri, 6 Feb 2004 10:29:43 +0900,
Harry Ohlsen wrote in [ruby-talk:91665]:
So, I guess my questions are:
(1) is this likely to be what was really going wrong, or did
adding the rb_gc_mark() calls fix the problem by pure luck
and it's waiting to bite me again, further down the track;

Seems correct.
(2) is there some way I can mark all of those objects as
always being in use, so that they'll never be considered for
garbage collection;

You may want to use rb_gc_register_address()?
and more importantly (3) is there a better way to do achieve
this?

Is it the single big hash in process, but not per instance,
right?

2 ways:
1. rb_gc_register_address(),

2. make the hash a hidden instance variable of any class, if
exists.
 
H

Harry Ohlsen

You may want to use rb_gc_register_address()?

Thanks. I'll look that up!
Is it the single big hash in process, but not per instance,
right?

There's a single hash table, that's not registered as a Ruby object. However, the data stored in the table are all VALUE objects obtained from rb_str_new2().

There is a single instance of an object implemented in C that Ruby *does* know about, and that object holds many references to the strings in the hash table. For one document, for example, there were 2.7 million references to around 87,000 strings, totalling just under 267,000 bytes of text. It's the ..._mark() function of that class where I currently mark all of the strings.

While it doesn't *seem* to be taking a huge amount of time to do that each time, I'd like to try to avoid it, just to see whether that really is the case ... ie, see whether the actual time being used is significant. In any case, if it's easy to fix, it just doesn't make sense to keep marking them over and over.
2. make the hash a hidden instance variable of any class, if
exists.

That might work, but as things are currently, Ruby doesn't know anything about the hash table. It's simply an implementation detail of the extension. However, if I can't work out how to get it working with rb_gc_register_address(), I'll see if I can do something along these lines.

Thanks for the suggestions!

Harry O.
 
N

nobu.nokada

Hi,

At Fri, 6 Feb 2004 11:49:59 +0900,
Harry Ohlsen wrote in [ruby-talk:91669]:
There's a single hash table, that's not registered as a Ruby
object. However, the data stored in the table are all VALUE
objects obtained from rb_str_new2().

You mean struct st_table? If so, you may register a Hash
instance to GC and use its tbl member directly.
There is a single instance of an object implemented in C that
Ruby *does* know about, and that object holds many references
to the strings in the hash table. For one document, for
example, there were 2.7 million references to around 87,000
strings, totalling just under 267,000 bytes of text. It's
the ..._mark() function of that class where I currently mark
all of the strings.

Current ruby's GC is weak for large amounts of live objects.
While it doesn't *seem* to be taking a huge amount of time to
do that each time, I'd like to try to avoid it, just to see
whether that really is the case ... ie, see whether the
actual time being used is significant. In any case, if it's
easy to fix, it just doesn't make sense to keep marking them
over and over.

Generational GC may help you, but it isn't still incorporated.
 
H

Harry Ohlsen

You mean struct st_table?

Sorry, I should have been more specific. The hash table is just some C code I wrote to implement one. It's only he data held in it that are Ruby objects (String). It's an interesting point, though. Maybe I could save myself some code by changing the C to use a Ruby hash.

I've only learned enough about C extensions to get done what I needed. I plan to do some serious study when I get a chance. I must say, it was pretty easy to get started ... as I would expect from anything related to Ruby, of course!
If so, you may register a Hash
instance to GC and use its tbl member directly.

That definitely sounds simpler.
Current ruby's GC is weak for large amounts of live objects.

I have a feeling this is why REXML had a problem loading the document, because it probably needs to create quite a few other (sub-)objects for each XML tag, hence it would *really* be working hard!
Generational GC may help you, but it isn't still incorporated.

I've seen mention of GGC a number of times on the list. Is there a plan to add it to Ruby 1.X.Y, or will we have to wait until version 2?

Cheers,

Harry O.
 
R

Ralf Horstmann

At Fri, 06 Feb 2004 10:29:43 +0900 wrote Harry Ohlsen:
Some background ...

I have an application where there are many identical strings
(the data consists of huge chunks of XML, with a lot of
duplication in both the tag names and the CDATA content).

I've written a tiny XML parser in C, because trying to load these
documents using REXML ran all night and was still running the next day,
presumably due to the size (hundreds of thousands of tags).

Have you already tried xmlparser (wrapper around
expat)? It's quite fast. I use it for huge XML documents where
rexml and nqxml are way too slow.

Ralf.
 
H

Harry Ohlsen

Ralf said:
Have you already tried xmlparser (wrapper around
expat)?
Back when I originally wrote it, I didn't have control of the box and
hence couldn't get expat installed easily, so I didn't look any further
at the time. However, I might give it a go by installing it in my own
account. I also didn't have a lot of time to get this up and running
back then.

Since I already had some C code that did what I wanted (and nothing
more), I figured it would be faster to wrap it ... plus, in the back of
my mind, I'm sure I was thinking "what a great opportunity to learn how
to do C extensions" :).

Nobu's suggestion worked fine, although I've not benchmarked yet to see
whether the change has made a significant difference ... this thing
takes quite a while to run, so it's hard to tell unless you think to
look at the clock, or print some timestamps out, which is what I'll do
when I get back to work on Monday.
It's quite fast. I use it for huge XML documents where
rexml and nqxml are way too slow.
Just out of interest, how large was your "huge". Some of my documents
are (literally) hundreds of megabytes.

The other point I should make is that this application has to be able to
make fairly arbitrary changes to the DOM, like moving whole subtrees
around, and the changes are user-defined, hence I can't even use some
kind of smart housekeeping, so event driven won't work for me.

Cheers,

Harry O.
 
R

Ralf Horstmann

At Sat, 07 Feb 2004 07:46:33 +0900 wrote Harry Ohlsen:
Just out of interest, how large was your "huge". Some of my documents
are (literally) hundreds of megabytes.

I just checked and found it to be about 10 megabytes. So actually not that
much data. But it was already enough to let rexml run for hours :)

Regards,
Ralf.
 
Z

Zachary P. Landau

--bp/iNruPH9dso1Pn
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable
=20
I just checked and found it to be about 10 megabytes. So actually not that
much data. But it was already enough to let rexml run for hours :)

There seems to be a problem/bug/whatever with the current version of
REXML that makes large files take extra long to process. It reads the
entire file in before it starts processing, which kills performance. Try
adding this code to your program:


module REXML
class IOSource
alias_method :_initialize, :initialize

def initialize(arg, block_size=3D500)
@er_source =3D @source =3D arg
@to_utf =3D false
@line_break =3D '>'
super @source.readline(@line_break)
@line_break =3D encode( '>' )
end
end
end

That seems to fix the problem for other people.

--
Zachary P. Landau <[email protected]>
GPG: gpg --recv-key 0x24E5AD99 | http://kapheine.hypa.net/kapheine.asc

--bp/iNruPH9dso1Pn
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)

iD8DBQFAJSXfCwWyMCTlrZkRAjNcAJsF8I511wJdohLMvWRR8xy68gD+NwCeK4pu
OGdtmqcZmrSc/a26S5RcfuM=
=ojMg
-----END PGP SIGNATURE-----

--bp/iNruPH9dso1Pn--
 
H

Harry Ohlsen

Zachary said:
There seems to be a problem/bug/whatever with the current version of
REXML that makes large files take extra long to process.
Is this a new problem introduced in a recent version? If so, it's
probably not the cause of the slowness I was seeing, because I tried it
about five or six months ago.

However, it's definitely worth knowing about that patch for the next
time I want to do some XML processing, because REXML is just so nice to
use that it would normally be my first choice!

Cheers,

Harry O.
 
Z

Zachary P. Landau

--wq9mPyueHGvFACwf
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

Zachary P. Landau wrote:
=20
Is this a new problem introduced in a recent version? If so, it's=20
probably not the cause of the slowness I was seeing, because I tried it= =20
about five or six months ago.
=20
However, it's definitely worth knowing about that patch for the next=20
time I want to do some XML processing, because REXML is just so nice to= =20
use that it would normally be my first choice!

The problem came with 1.8.1, so that wouldn't have been the problem.
Your probably was probably just a huge file :p

--
Zachary P. Landau <[email protected]>
GPG: gpg --recv-key 0x24E5AD99 | http://kapheine.hypa.net/kapheine.asc

--wq9mPyueHGvFACwf
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)

iD8DBQFAJZbtCwWyMCTlrZkRAsMyAJ0W4Qkr+5ZDiNWInbN4azc6gBS6awCfYhn6
8Jes66FixD4BTOQQcyZ1h/w=
=hhTh
-----END PGP SIGNATURE-----

--wq9mPyueHGvFACwf--
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,581
Members
45,055
Latest member
SlimSparkKetoACVReview

Latest Threads

Top