Faster Marshaling?

Discussion in 'Ruby' started by Greg Willits, Jul 27, 2008.

  1. Greg Willits

    Greg Willits Guest

    Exploring options... wondering if there's anything that can replace
    marshaling that's similar in usage (dump & load to/from disk file), but
    faster than the native implementation in Ruby 1.8.6

    I can explain some details if necessary, but in short:

    - I need to marshal,

    - I need to swap data sets often enough that performance
    will be a problem (currently it can take several seconds to restore
    some marshaled data -- way too long)

    - the scaling is such that more RAM per box is costly enough to pay for
    development of a more RAM efficient design

    - faster performance Marshaling is worth asking about to see how much
    it'll get me.

    I'm hoping there's something that's as close to a memory space dump &
    restore as possible -- no need to "reconstruct" data piece by piece
    which Ruby seems to be doing now. It takes < 250ms to load an 11MB raw
    data file via readlines, and 2 seconds to load a 9MB sized Marshal file,
    so clearly Ruby is busy rebuilding stuff rather than just pumping a RAM
    block with a binary image.

    TIA for any ideas.

    -- gw
    --
    Posted via http://www.ruby-forum.com/.
    Greg Willits, Jul 27, 2008
    #1
    1. Advertising

  2. Greg Willits

    Eric Hodel Guest

    On Jul 26, 2008, at 19:58 PM, Greg Willits wrote:
    > Exploring options... wondering if there's anything that can replace
    > marshaling that's similar in usage (dump & load to/from disk file),
    > but
    > faster than the native implementation in Ruby 1.8.6
    >
    > I can explain some details if necessary, but in short:
    >
    > - I need to marshal,
    >
    > - I need to swap data sets often enough that performance
    > will be a problem (currently it can take several seconds to restore
    > some marshaled data -- way too long)
    >
    > - the scaling is such that more RAM per box is costly enough to pay
    > for
    > development of a more RAM efficient design
    >
    > - faster performance Marshaling is worth asking about to see how much
    > it'll get me.
    >
    > I'm hoping there's something that's as close to a memory space dump &
    > restore as possible -- no need to "reconstruct" data piece by piece
    > which Ruby seems to be doing now. It takes < 250ms to load an 11MB raw
    > data file via readlines, and 2 seconds to load a 9MB sized Marshal
    > file,


    readlines? not read? readlines should be used for text, not binary
    data. Also, supplying an IO to Marshal.load instead of a pre-read
    String adds about 30% overhead for constant calls to getc.

    9MB seems like a lot of data to load, how many objects are in the
    dump? Do you really need to load a set of objects that large?

    > so clearly Ruby is busy rebuilding stuff rather than just pumping a
    > RAM
    > block with a binary image.


    Ruby is going to need to call allocate for each object in order to
    register with the GC and build the proper object graph. I doubt
    there's a way around this without extensive modification to ruby.
    Eric Hodel, Jul 27, 2008
    #2
    1. Advertising

  3. On Saturday 26 July 2008 21:58:22 Greg Willits wrote:

    > - I need to swap data sets often enough that performance
    > will be a problem (currently it can take several seconds to restore
    > some marshaled data -- way too long)


    Why do you need to do this yourself?

    > - the scaling is such that more RAM per box is costly enough to pay for
    > development of a more RAM efficient design


    What about more swap per box?

    It might be slower, maybe not, but it seems like the easiest thing to try.

    Another possibility would be to use something like ActiveRecord -- though you
    probably want something much more lightweight (suggestions? I keep forgetting
    what's out there...) -- after all, you probably aren't operating on the whole
    dataset at once, so what you really want is something reasonably fast at
    loading/saving individual objects.
    David Masover, Jul 27, 2008
    #3
  4. Greg Willits

    Greg Willits Guest

    Eric Hodel wrote:
    > On Jul 26, 2008, at 19:58 PM, Greg Willits wrote:
    >> will be a problem (currently it can take several seconds to restore
    >> restore as possible -- no need to "reconstruct" data piece by piece
    >> which Ruby seems to be doing now. It takes < 250ms to load an 11MB raw
    >> data file via readlines, and 2 seconds to load a 9MB sized Marshal
    >> file,


    > Ruby is going to need to call allocate for each object in order to
    > register with the GC and build the proper object graph. I doubt
    > there's a way around this without extensive modification to ruby.


    Hmm. makes sense of course, was just hoping someone had a clever
    replacement.

    I'll just have to try clever code that minimizes the frequency of
    re-loads.

    If you're curious about the back story, I've explained it more below.


    > readlines? not read? readlines should be used for text, not binary
    > data. Also, supplying an IO to Marshal.load instead of a pre-read
    > String adds about 30% overhead for constant calls to getc.


    Wasn't using it on binary data -- was just making a note that an 11MB
    tab file (about 45,000 lines) took all of 250ms (actually 90ms on my
    server drives) to read into an array using readlines. Whereas loading a
    marshaled version of that same data (reorganized, and saved as an array
    of hashes) from a file that happened to be 9Mb took almost 2 seconds --
    so there's clearly a lot of overhead in re-storing a marshaled object.
    That was my point.

    > 9MB seems like a lot of data to load, how many objects are in the
    > dump? Do you really need to load a set of objects that large?


    Yes, and that's not the largest, but it's an average. Range is 1 MB to
    30 MB of raw data per file. A few are 100+ MB, one is 360 MB on it's
    own, but it's an exception.

    This is a data aggregation framework. One generic framework will run as
    multiple app-specific instances where each application has a data set of
    4-8GB of raw text data (from 200-400 files). That raw data is loaded,
    reorganized into standardized structures, and one or more indexes
    generated per original file.

    One application instance per server. The server is used as a workgroup
    intranet web server by day (along with it's redundant twin), and as an
    aggregator by night.

    That 9MB Marshaled file is the result of one data source of 45,000 lines
    being re-arranged, each data element cleansed and transformed, and then
    stored as an array of hashes. An index is stored as a separate Marshaled
    file so it can be loaded independently.

    Those 300 or so original files, having been processed and indexed, are
    now searched and combined in a complex aggregation (sadly, not just
    simple mergers) which nets a couple dozen tab files for LOAD DATA into a
    database for the web app.

    Based on a first version of this animal, spikes on faster hardware,
    accounting for new algorithms and growth in data sizes, this process
    will take several hours even on a new intel server even with everything
    loaded into RAM. And that's before we start to add a number of new
    tasks to the application.

    Of course, we're looking at ways to split the processing to take
    advantage of multiple cores, but that just adds more demand on memory
    (DRb way too slow by a couple orders of magnitude to consider using as a
    common "memory" space" for all cores).

    The aggregation is complex enough that in a perfect world, I'd have the
    entire data set in RAM all at once, because any one final data table
    pulls it's data from numerous sources and alternate sources if the
    primary doesn't have it on a field by field basis. field1 comes from
    sourceX, field2 from sourceA, and sourceB if A doesn't have it. It gets
    hairy :)

    Unlike a massive web application where any one transaction can take as
    long as 1 second even 2 to complete, and you throw more machines at it
    to handle increase in requests, this is a task trying to get tens of
    millions of field tranformations, and millions of hash reads completed
    linearly as quickly as possible. So, the overhead of DRb and similar
    approaches aren't good enough.


    David Masover wrote:
    >> - I need to swap data sets often enough that performance
    >> will be a problem (currently it can take several seconds to restore
    >> some marshaled data -- way too long)

    >
    > Why do you need to do this yourself?


    As a test, I took that one 9MB sample file mentioned above, and loaded
    it as 6 unique objects to see how long that would take, and how much RAM
    would get used -- Ruby ballooned into using 500MB of RAM. In theory I
    would like to have every one of those 300 files in meory, but
    logistically I can easily get away with 50 to 100 at once. But if Ruby
    is going to balloon that massively, I won't even get close to 50 such
    data sets in RAM at once. So, I "need" to be able to swap data sets in &
    out of RAM as needed (hopefully with an algorithm that minimizes the
    swapping by processing batches which all refc the same loaded data
    sets).


    >> - the scaling is such that more RAM per box is costly enough to pay for
    >> development of a more RAM efficient design

    >
    > What about more swap per box? It might be slower, maybe not, but it seems
    > like the easiest thing to try.


    More "swap"? You mean virtual memory? I may be wrong, but I am assuming
    regardless of how effective VM is, I can easily saturate real RAM, and
    it's been my experience that systems just don't like all of their real
    RAM full.

    Unless there's some Ruby commands to tell it to specifically push
    objects into the OS's VM, I think I am stuck having to manage RAM
    consumption on my own. ??


    > Another possibility would be to use something like ActiveRecord --


    Using the db especially through AR would be glacial. We have a db-based
    process now, and need something faster.

    -- gw

    --
    Posted via http://www.ruby-forum.com/.
    Greg Willits, Jul 27, 2008
    #4
  5. On Sunday 27 July 2008 00:07:10 Greg Willits wrote:

    > >> - the scaling is such that more RAM per box is costly enough to pay for
    > >> development of a more RAM efficient design

    > >
    > > What about more swap per box? It might be slower, maybe not, but it seems
    > > like the easiest thing to try.

    >
    > More "swap"? You mean virtual memory? I may be wrong, but I am assuming
    > regardless of how effective VM is, I can easily saturate real RAM, and
    > it's been my experience that systems just don't like all of their real
    > RAM full.


    In general, yes. However, if this is all the system it's doing, I'm suggesting
    that it may be useful -- assuming there isn't something else that makes this
    impractical, like garbage collection pulling everything out of RAM to see if
    it can be collected. (I don't know enough about how Ruby garbage collection
    works to know if this is a problem.)

    But then, given the sheer size problem you mentioned earlier, it probably
    wouldn't work well.

    > > Another possibility would be to use something like ActiveRecord --

    >
    > Using the db especially through AR would be glacial. We have a db-based
    > process now, and need something faster.


    I specifically mean something already designed for this purpose -- not
    necessarily a traditional database. Something like berkdb, or "stone" (I
    think that's what it was called) -- or splitting it into a bunch of files, on
    a decent filesystem.
    David Masover, Jul 27, 2008
    #5
  6. Greg Willits

    Greg Willits Guest

    David Masover wrote:
    > On Sunday 27 July 2008 00:07:10 Greg Willits wrote:


    >> > Another possibility would be to use something like ActiveRecord --

    >>
    >> Using the db especially through AR would be glacial. We have a db-based
    >> process now, and need something faster.

    >
    > I specifically mean something already designed for this purpose -- not
    > necessarily a traditional database. Something like berkdb, or "stone" (I
    > think that's what it was called) -- or splitting it into a bunch of
    > files, on a decent filesystem.


    Berkely DB has been sucked up by Oracle, and I don't think it ever ran
    on OS X anyway.

    We have talked about skipping Marshaling and going straight to standard
    text files on disk and then using read commands that point to a specific
    file line.

    We haven't spiked that yet, but I don't see it being significantly
    faster than using a local db (especially since db cache might be
    useful), but it's something we'll probably at least investigate just to
    prove it's comparative performance. It might be faster just because we
    can keep all indexes in RAM. Get some 15,000 rpm drives, probably
    implement some caching to reduce disk reads

    So, yeah, maybe that or even sqlite might be suitable if the RAM thing
    just gets too obnoxious to solve. Something that would prove to be
    faster than MySQL.

    -- gw

    --
    Posted via http://www.ruby-forum.com/.
    Greg Willits, Jul 27, 2008
    #6
  7. On Sunday 27 July 2008 00:33:56 Greg Willits wrote:

    > We have talked about skipping Marshaling and going straight to standard
    > text files on disk and then using read commands that point to a specific
    > file line.


    If the files aren't changing, you probably want to seek to a specific byte
    offset in the file, rather than a line -- the latter requires you to read
    through the entire file up to that line.

    > We haven't spiked that yet, but I don't see it being significantly
    > faster than using a local db (especially since db cache might be
    > useful),


    More useful than the FS cache?

    > So, yeah, maybe that or even sqlite might be suitable if the RAM thing
    > just gets too obnoxious to solve. Something that would prove to be
    > faster than MySQL.


    For what it's worth, ActiveRecord does work on SQLite. So does Sequel, and I
    bet DataMapper does, too.

    I mentioned BerkDB because I assumed it would be faster than SQLite -- but
    that's a completely uninformed guess.
    David Masover, Jul 27, 2008
    #7
  8. Greg Willits wrote:
    >>> - the scaling is such that more RAM per box is costly enough to pay for
    >>> development of a more RAM efficient design

    >> What about more swap per box? It might be slower, maybe not, but it seems
    >> like the easiest thing to try.

    >
    > More "swap"? You mean virtual memory? I may be wrong, but I am assuming
    > regardless of how effective VM is, I can easily saturate real RAM, and
    > it's been my experience that systems just don't like all of their real
    > RAM full.


    More swap might help, if you assign one ruby process per data set. Then
    switching data sets means just letting the vm swap in a different
    process, if it needs to.

    --
    vjoel : Joel VanderWerf : path berkeley edu : 510 665 3407
    Joel VanderWerf, Jul 27, 2008
    #8
  9. On 27.07.2008 07:44, David Masover wrote:
    > On Sunday 27 July 2008 00:33:56 Greg Willits wrote:
    >
    >> We have talked about skipping Marshaling and going straight to standard
    >> text files on disk and then using read commands that point to a specific
    >> file line.

    >
    > If the files aren't changing, you probably want to seek to a specific byte
    > offset in the file, rather than a line -- the latter requires you to read
    > through the entire file up to that line.


    Array#pack and String#unpack come to mind. But IMHO this is still
    inferior to using a relational database because in the end it comes down
    to reimplementing the same mechanisms that are present there already.

    > For what it's worth, ActiveRecord does work on SQLite. So does Sequel, and I
    > bet DataMapper does, too.


    But keep in mind that AR and the like introduce some overhead of
    themselves. It might be faster to just use plain SQL to get at the data.

    But given the problem description I would definitively go for a
    relational or other database system. There is no point in inventing the
    wheel (aka fast indexing of large data volumes on disk) yourself. You
    might even check RAA for an implementation of B-trees.

    Kind regards

    robert
    Robert Klemme, Jul 27, 2008
    #9
  10. On 27.07.2008 12:21, Robert Klemme wrote:

    > But given the problem description I would definitively go for a
    > relational or other database system. There is no point in inventing the
    > wheel (aka fast indexing of large data volumes on disk) yourself. You
    > might even check RAA for an implementation of B-trees.


    Just after sending I remembered a thread in another newsgroup. The
    problem sounds a bit related to yours and eventually the guy ended up
    using CDB:

    http://cr.yp.to/cdb.html

    There's even a Ruby binding:

    http://raa.ruby-lang.org/project/cdb/

    His summary is here, the problem description is at the beginning of the
    thread:

    http://groups.google.com/group/comp.unix.programmer/msg/420c2cef773f5188

    Kind regards

    robert
    Robert Klemme, Jul 27, 2008
    #10
  11. Greg Willits

    James Gray Guest

    On Jul 27, 2008, at 12:33 AM, Greg Willits wrote:

    > So, yeah, maybe that or even sqlite might be suitable if the RAM thing
    > just gets too obnoxious to solve.


    I would be shocked if SQLite can't be made to solve the problem well
    with the right planning. That little database is always surprising
    me. Don't forget to look into the following two features as it sounds
    like they may be helpful in this case:

    * In memory databases
    * Attaching multiple SQLite files to perform queries across them

    James Edward Gray II
    James Gray, Jul 27, 2008
    #11
  12. On Sun, Jul 27, 2008 at 11:32:35PM +0900, James Gray wrote:
    > On Jul 27, 2008, at 12:33 AM, Greg Willits wrote:
    >
    >> So, yeah, maybe that or even sqlite might be suitable if the RAM thing
    >> just gets too obnoxious to solve.

    >
    > I would be shocked if SQLite can't be made to solve the problem well with
    > the right planning. That little database is always surprising me. Don't
    > forget to look into the following two features as it sounds like they may
    > be helpful in this case:
    >
    > * In memory databases
    > * Attaching multiple SQLite files to perform queries across them


    Yup, SQLite is a lovely piece of work. If you go that route, you might give
    Amalgalite a try, it embeds sqlite3 inside a ruby extension using the SQLite
    amalgamation source. I'd love to see if it can stand up to the demands of your
    system.

    enjoy,

    -jeremy

    --
    ========================================================================
    Jeremy Hinegardner
    Jeremy Hinegardner, Jul 28, 2008
    #12
  13. Greg Willits wrote:
    > Berkely DB has been sucked up by Oracle, and I don't think it ever ran
    > on OS X anyway.


    OS X, eh? First, check the MacPorts project (http://macports.org) and
    search for the db44 or db46 ports for your Berkley DB. Second, even
    though it's in an early development stage, you might want to look into
    MacRuby (http://ruby.macosforge.org). In MacRuby, every Ruby object is
    also a subclass of NSObject, so in theory you should be able to use all
    of the same NSData read/write operations as Cocoa objects. It may well
    be that restoring MacRuby objects written out in this way doesn't
    currently work (haven't had a chance to try it myself...yet), but in
    that case, you could at least file a bug with the project.

    --
    Posted via http://www.ruby-forum.com/.
    Joshua Ballanco, Jul 28, 2008
    #13
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Kannan
    Replies:
    0
    Views:
    539
    Kannan
    Mar 11, 2005
  2. isthar

    Marshaling unicode WDDX

    isthar, Jan 5, 2006, in forum: Python
    Replies:
    4
    Views:
    319
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=
    Jan 10, 2006
  3. Replies:
    2
    Views:
    407
    Srini
    Aug 18, 2005
  4. Sree

    Namevaluecollection Vs Webservices(Marshaling)

    Sree, Sep 27, 2004, in forum: ASP .Net Web Services
    Replies:
    0
    Views:
    126
  5. Marshaling IIdentity via web services.

    , Dec 17, 2004, in forum: ASP .Net Web Services
    Replies:
    3
    Views:
    113
    Dilip Krishnan
    Dec 20, 2004
Loading...

Share This Page