Is there any way to mark an object as "always in use" (specifically,in a C extension)?

Discussion in 'Ruby' started by Harry Ohlsen, Feb 6, 2004.

  1. Harry Ohlsen

    Harry Ohlsen Guest

    Some background ...

    I have an application where there are many identical strings (the data consists of huge chunks of XML, with a lot of duplication in both the tag names and the CDATA content).

    I've written a tiny XML parser in C, because trying to load these documents using REXML ran all night and was still running the next day, presumably due to the size (hundreds of thousands of tags).

    Anyway, to reduce the memory used, given the repetitive nature of a lot of the data, I decided to store the strings as a (C coded) hash table of VALUE objects.

    Changes to the data are very, very few, so when this happens, I just create a new Ruby string, so the values in the hash table never change.

    Now, to my questions ...

    I found that when I played with particularly large documents, my code fell over with what looked like some kind of memory corruption. I eventually twigged to the fact that Ruby might be garbage collecting some of the strings I'd constructed, because my C code wasn't doing any rb_gc_mark() calls. That definitely seemed to be the story, because when I wrote one that just went through the entire hashtable and marked each value, the corruption disappeared.

    So, I guess my questions are: (1) is this likely to be what was really going wrong, or did adding the rb_gc_mark() calls fix the problem by pure luck and it's waiting to bite me again, further down the track; (2) is there some way I can mark all of those objects as always being in use, so that they'll never be considered for garbage collection; and more importantly (3) is there a better way to do achieve this?

    Thanks in advance,

    Harry O.
     
    Harry Ohlsen, Feb 6, 2004
    #1
    1. Advertising

  2. Harry Ohlsen

    Guest

    Re: Is there any way to mark an object as "always in use" (specifically, in a C extension)?

    Hi,

    At Fri, 6 Feb 2004 10:29:43 +0900,
    Harry Ohlsen wrote in [ruby-talk:91665]:
    > So, I guess my questions are:
    > (1) is this likely to be what was really going wrong, or did
    > adding the rb_gc_mark() calls fix the problem by pure luck
    > and it's waiting to bite me again, further down the track;


    Seems correct.

    > (2) is there some way I can mark all of those objects as
    > always being in use, so that they'll never be considered for
    > garbage collection;


    You may want to use rb_gc_register_address()?

    > and more importantly (3) is there a better way to do achieve
    > this?


    Is it the single big hash in process, but not per instance,
    right?

    2 ways:
    1. rb_gc_register_address(),

    2. make the hash a hidden instance variable of any class, if
    exists.

    --
    Nobu Nakada
     
    , Feb 6, 2004
    #2
    1. Advertising

  3. Harry Ohlsen

    Harry Ohlsen Guest

    wrote:
    >>(2) is there some way I can mark all of those objects as
    >>always being in use, so that they'll never be considered for
    >>garbage collection;

    >
    >
    > You may want to use rb_gc_register_address()?


    Thanks. I'll look that up!

    > Is it the single big hash in process, but not per instance,
    > right?


    There's a single hash table, that's not registered as a Ruby object. However, the data stored in the table are all VALUE objects obtained from rb_str_new2().

    There is a single instance of an object implemented in C that Ruby *does* know about, and that object holds many references to the strings in the hash table. For one document, for example, there were 2.7 million references to around 87,000 strings, totalling just under 267,000 bytes of text. It's the ..._mark() function of that class where I currently mark all of the strings.

    While it doesn't *seem* to be taking a huge amount of time to do that each time, I'd like to try to avoid it, just to see whether that really is the case ... ie, see whether the actual time being used is significant. In any case, if it's easy to fix, it just doesn't make sense to keep marking them over and over.

    > 2. make the hash a hidden instance variable of any class, if
    > exists.


    That might work, but as things are currently, Ruby doesn't know anything about the hash table. It's simply an implementation detail of the extension. However, if I can't work out how to get it working with rb_gc_register_address(), I'll see if I can do something along these lines.

    Thanks for the suggestions!

    Harry O.
     
    Harry Ohlsen, Feb 6, 2004
    #3
  4. Harry Ohlsen

    Guest

    Re: Is there any way to mark an object as "always in use" (specifically, in a C extension)?

    Hi,

    At Fri, 6 Feb 2004 11:49:59 +0900,
    Harry Ohlsen wrote in [ruby-talk:91669]:
    > > Is it the single big hash in process, but not per instance,
    > > right?

    >
    > There's a single hash table, that's not registered as a Ruby
    > object. However, the data stored in the table are all VALUE
    > objects obtained from rb_str_new2().


    You mean struct st_table? If so, you may register a Hash
    instance to GC and use its tbl member directly.

    > There is a single instance of an object implemented in C that
    > Ruby *does* know about, and that object holds many references
    > to the strings in the hash table. For one document, for
    > example, there were 2.7 million references to around 87,000
    > strings, totalling just under 267,000 bytes of text. It's
    > the ..._mark() function of that class where I currently mark
    > all of the strings.


    Current ruby's GC is weak for large amounts of live objects.

    > While it doesn't *seem* to be taking a huge amount of time to
    > do that each time, I'd like to try to avoid it, just to see
    > whether that really is the case ... ie, see whether the
    > actual time being used is significant. In any case, if it's
    > easy to fix, it just doesn't make sense to keep marking them
    > over and over.


    Generational GC may help you, but it isn't still incorporated.

    --
    Nobu Nakada
     
    , Feb 6, 2004
    #4
  5. Harry Ohlsen

    Harry Ohlsen Guest

    wrote:
    >>There's a single hash table, that's not registered as a Ruby
    >>object. However, the data stored in the table are all VALUE
    >>objects obtained from rb_str_new2().

    >
    >
    > You mean struct st_table?


    Sorry, I should have been more specific. The hash table is just some C code I wrote to implement one. It's only he data held in it that are Ruby objects (String). It's an interesting point, though. Maybe I could save myself some code by changing the C to use a Ruby hash.

    I've only learned enough about C extensions to get done what I needed. I plan to do some serious study when I get a chance. I must say, it was pretty easy to get started ... as I would expect from anything related to Ruby, of course!

    > If so, you may register a Hash
    > instance to GC and use its tbl member directly.


    That definitely sounds simpler.

    > Current ruby's GC is weak for large amounts of live objects.


    I have a feeling this is why REXML had a problem loading the document, because it probably needs to create quite a few other (sub-)objects for each XML tag, hence it would *really* be working hard!

    > Generational GC may help you, but it isn't still incorporated.


    I've seen mention of GGC a number of times on the list. Is there a plan to add it to Ruby 1.X.Y, or will we have to wait until version 2?

    Cheers,

    Harry O.
     
    Harry Ohlsen, Feb 6, 2004
    #5
  6. Re: Is there any way to mark an object as "always in use" (specifically, in a C extension)?

    At Fri, 06 Feb 2004 10:29:43 +0900 wrote Harry Ohlsen:

    > Some background ...
    >
    > I have an application where there are many identical strings
    > (the data consists of huge chunks of XML, with a lot of
    > duplication in both the tag names and the CDATA content).
    >
    > I've written a tiny XML parser in C, because trying to load these
    > documents using REXML ran all night and was still running the next day,
    > presumably due to the size (hundreds of thousands of tags).


    Have you already tried xmlparser (wrapper around
    expat)? It's quite fast. I use it for huge XML documents where
    rexml and nqxml are way too slow.

    Ralf.
     
    Ralf Horstmann, Feb 6, 2004
    #6
  7. Harry Ohlsen

    Harry Ohlsen Guest

    Ralf Horstmann wrote:

    >Have you already tried xmlparser (wrapper around
    >expat)?
    >

    Back when I originally wrote it, I didn't have control of the box and
    hence couldn't get expat installed easily, so I didn't look any further
    at the time. However, I might give it a go by installing it in my own
    account. I also didn't have a lot of time to get this up and running
    back then.

    Since I already had some C code that did what I wanted (and nothing
    more), I figured it would be faster to wrap it ... plus, in the back of
    my mind, I'm sure I was thinking "what a great opportunity to learn how
    to do C extensions" :).

    Nobu's suggestion worked fine, although I've not benchmarked yet to see
    whether the change has made a significant difference ... this thing
    takes quite a while to run, so it's hard to tell unless you think to
    look at the clock, or print some timestamps out, which is what I'll do
    when I get back to work on Monday.

    >It's quite fast. I use it for huge XML documents where
    >rexml and nqxml are way too slow.
    >
    >

    Just out of interest, how large was your "huge". Some of my documents
    are (literally) hundreds of megabytes.

    The other point I should make is that this application has to be able to
    make fairly arbitrary changes to the DOM, like moving whole subtrees
    around, and the changes are user-defined, hence I can't even use some
    kind of smart housekeeping, so event driven won't work for me.

    Cheers,

    Harry O.
     
    Harry Ohlsen, Feb 6, 2004
    #7
  8. Re: Is there any way to mark an object as "always in use" (specifically, in a C extension)?

    At Sat, 07 Feb 2004 07:46:33 +0900 wrote Harry Ohlsen:

    >>It's quite fast. I use it for huge XML documents where
    >>rexml and nqxml are way too slow.
    >>

    > Just out of interest, how large was your "huge". Some of my documents
    > are (literally) hundreds of megabytes.


    I just checked and found it to be about 10 megabytes. So actually not that
    much data. But it was already enough to let rexml run for hours :)

    Regards,
    Ralf.
     
    Ralf Horstmann, Feb 7, 2004
    #8
  9. Re: Is there any way to mark an object as "always in use" (specifically, in a C extension)?

    --bp/iNruPH9dso1Pn
    Content-Type: text/plain; charset=us-ascii
    Content-Disposition: inline
    Content-Transfer-Encoding: quoted-printable

    > >>It's quite fast. I use it for huge XML documents where
    > >>rexml and nqxml are way too slow.
    > >>

    > > Just out of interest, how large was your "huge". Some of my documents=

    =20
    > > are (literally) hundreds of megabytes.

    >=20
    > I just checked and found it to be about 10 megabytes. So actually not that
    > much data. But it was already enough to let rexml run for hours :)


    There seems to be a problem/bug/whatever with the current version of
    REXML that makes large files take extra long to process. It reads the
    entire file in before it starts processing, which kills performance. Try
    adding this code to your program:


    module REXML
    class IOSource
    alias_method :_initialize, :initialize

    def initialize(arg, block_size=3D500)
    @er_source =3D @source =3D arg
    @to_utf =3D false
    @line_break =3D '>'
    super @source.readline(@line_break)
    @line_break =3D encode( '>' )
    end
    end
    end

    That seems to fix the problem for other people.

    --
    Zachary P. Landau <>
    GPG: gpg --recv-key 0x24E5AD99 | http://kapheine.hypa.net/kapheine.asc

    --bp/iNruPH9dso1Pn
    Content-Type: application/pgp-signature
    Content-Disposition: inline

    -----BEGIN PGP SIGNATURE-----
    Version: GnuPG v1.2.4 (GNU/Linux)

    iD8DBQFAJSXfCwWyMCTlrZkRAjNcAJsF8I511wJdohLMvWRR8xy68gD+NwCeK4pu
    OGdtmqcZmrSc/a26S5RcfuM=
    =ojMg
    -----END PGP SIGNATURE-----

    --bp/iNruPH9dso1Pn--
     
    Zachary P. Landau, Feb 7, 2004
    #9
  10. Harry Ohlsen

    Harry Ohlsen Guest

    Zachary P. Landau wrote:

    >There seems to be a problem/bug/whatever with the current version of
    >REXML that makes large files take extra long to process.
    >

    Is this a new problem introduced in a recent version? If so, it's
    probably not the cause of the slowness I was seeing, because I tried it
    about five or six months ago.

    However, it's definitely worth knowing about that patch for the next
    time I want to do some XML processing, because REXML is just so nice to
    use that it would normally be my first choice!

    Cheers,

    Harry O.
     
    Harry Ohlsen, Feb 8, 2004
    #10
  11. Re: Is there any way to mark an object as "always in use" (specifically, in a C extension)?

    --wq9mPyueHGvFACwf
    Content-Type: text/plain; charset=us-ascii
    Content-Disposition: inline
    Content-Transfer-Encoding: quoted-printable

    On Sun, Feb 08, 2004 at 09:02:38AM +0900, Harry Ohlsen wrote:
    > Zachary P. Landau wrote:
    >=20
    > >There seems to be a problem/bug/whatever with the current version of
    > >REXML that makes large files take extra long to process.
    > >

    > Is this a new problem introduced in a recent version? If so, it's=20
    > probably not the cause of the slowness I was seeing, because I tried it=

    =20
    > about five or six months ago.
    >=20
    > However, it's definitely worth knowing about that patch for the next=20
    > time I want to do some XML processing, because REXML is just so nice to=

    =20
    > use that it would normally be my first choice!


    The problem came with 1.8.1, so that wouldn't have been the problem.
    Your probably was probably just a huge file :p

    --
    Zachary P. Landau <>
    GPG: gpg --recv-key 0x24E5AD99 | http://kapheine.hypa.net/kapheine.asc

    --wq9mPyueHGvFACwf
    Content-Type: application/pgp-signature
    Content-Disposition: inline

    -----BEGIN PGP SIGNATURE-----
    Version: GnuPG v1.2.4 (GNU/Linux)

    iD8DBQFAJZbtCwWyMCTlrZkRAsMyAJ0W4Qkr+5ZDiNWInbN4azc6gBS6awCfYhn6
    8Jes66FixD4BTOQQcyZ1h/w=
    =hhTh
    -----END PGP SIGNATURE-----

    --wq9mPyueHGvFACwf--
     
    Zachary P. Landau, Feb 8, 2004
    #11
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Chris Bedford
    Replies:
    1
    Views:
    6,954
    Chris Bedford
    Oct 6, 2003
  2. Ian Giblin
    Replies:
    4
    Views:
    591
    Dave Townsend
    Nov 15, 2004
  3. Pete Jereb
    Replies:
    0
    Views:
    323
    Pete Jereb
    Oct 7, 2003
  4. Steven T. Hatton

    Not specifically C++, but worth sharing

    Steven T. Hatton, Aug 25, 2005, in forum: C++
    Replies:
    2
    Views:
    322
    Gabriel
    Aug 25, 2005
  5. Cirene
    Replies:
    5
    Views:
    588
    Cirene
    May 17, 2008
Loading...

Share This Page