handling large data sets

Discussion in 'Ruby' started by Martin Pirker, Dec 8, 2003.

  1. Hi...

    short version:
    What's your prefered way to handle large data sets used by Ruby scripts?

    long version:
    Some time ago I had a problem where Rubys text munching capabilities
    helped very much. Unfortunately, the data set was quite large, an array
    of ~40000 entries, whereas every entry is itself an array of ~8 data
    (String, Int,... whatever) points.
    All in all the final solution ran in ~2min time and iterated several
    times over the whole data set -> a brute force approach.
    What surprised me, a first "more efficient" implementation which took
    "notes" in temp structures while munching, but did require less full
    iterations, was much slower.
    Now I got to do such a thing again, but the data set will be in the
    upper 6-digit count.

    Brute force again?
    Is the GC the significant speed problem with temp data?
    Is the GC suited for xxxMb data at all?
    Is "mixing in" an external database maybe faster?
    ....

    Is there a reference for non source divers how Rubys basic
    datastructures (Array, Hash, String,...) perform with basic
    ops - insert, delete, append, search, ...
    Can be tested one for one, but if I know some things run at order n^2
    I won't even try?

    Any experiences to tell? :)

    Martin
     
    Martin Pirker, Dec 8, 2003
    #1
    1. Advertising

  2. On Monday, December 8, 2003, 5:42:14 PM, Martin wrote:

    > Hi...


    > short version:
    > What's your prefered way to handle large data sets used by Ruby scripts?


    > [...]


    > Any experiences to tell? :)


    Martin,

    I can't answer the specific technical questions, but the one
    experience I had with using Ruby to process large amounts of data
    involved went something like this. There was heaps of data in one set
    of files, and heaps of data in another set, and I had to cross-match
    the datasets to prepare data for insertion in a database.

    But I needed to run it regularly on growing, rather than different,
    files. I can't remember what was particularly expensive, but I saved
    a lot of time by caching pre-calculated results in a Marshal dump, so
    that each time the program was run, it could work out what data it did
    and didn't need to examine.

    The data I was marshalling was slow in 1.6 and fast in 1.8.

    It sounds like your data is an order of magnitude larger than mine,
    and probably a different problem, but I'm sure some use of Marshal, or
    YAML. can help.

    It's probably worth experimenting in loading a 10000-line file into
    internal data structures vs loading the pre-saved data structures.

    Cheers,
    Gavin
     
    Gavin Sinclair, Dec 8, 2003
    #2
    1. Advertising

  3. "Martin Pirker" <> schrieb im Newsbeitrag
    news:3fd41d05$0$30786$...
    > Hi...
    >
    > short version:
    > What's your prefered way to handle large data sets used by Ruby scripts?
    >
    > long version:
    > Some time ago I had a problem where Rubys text munching capabilities
    > helped very much. Unfortunately, the data set was quite large, an array
    > of ~40000 entries, whereas every entry is itself an array of ~8 data
    > (String, Int,... whatever) points.
    > All in all the final solution ran in ~2min time and iterated several
    > times over the whole data set -> a brute force approach.
    > What surprised me, a first "more efficient" implementation which took
    > "notes" in temp structures while munching, but did require less full
    > iterations, was much slower.
    > Now I got to do such a thing again, but the data set will be in the
    > upper 6-digit count.
    >
    > Brute force again?
    > Is the GC the significant speed problem with temp data?
    > Is the GC suited for xxxMb data at all?
    > Is "mixing in" an external database maybe faster?
    > ...
    >
    > Is there a reference for non source divers how Rubys basic
    > datastructures (Array, Hash, String,...) perform with basic
    > ops - insert, delete, append, search, ...
    > Can be tested one for one, but if I know some things run at order n^2
    > I won't even try?
    >
    > Any experiences to tell? :)


    Freezing Strings before using them as Hash keys helps, because normally a
    Hash dup's Strings that are used as keys.

    Avoid duplicating data in the "notes".

    It's difficult to get more specific since we don't know the data and types
    at hand.

    Kind regards

    robert
     
    Robert Klemme, Dec 8, 2003
    #3
  4. Received: Mon, 8 Dec 2003 15:42:14 +0900
    And lo Martin wrote:

    > short version:
    > What's your prefered way to handle large data sets used by Ruby
    > scripts?
    >
    > Is "mixing in" an external database maybe faster?


    If you're working with a massive number of records, but not editing each
    of them in place all the time (or even if you are, it's highly
    dependent) - have you considered using SQL? As external databases are
    designed for managing data as large as yours, perhaps you should
    consider that as an option. Especially since ruby has pretty nice
    support for sql, in my experience (at least, mysql).
     
    Gregory Millam, Dec 8, 2003
    #4
  5. Martin Pirker

    Armin Roehrl Guest

    how large is the data?
    what do you do with the data?

    We looked at some logfiles (~ 100 gigaybtes; can't remember exact size)
    and I noted that a single-pass over the data using a ruby-script was much
    faster than using mysql.

    probably obvious:
    Only load in memory what you absolutely need.
    Experiment with the data-chunks-sizes you read in.
    Turn off GC temporarily (if it is feasible)
    Write a few routines in C; (SWIG rules)

    good luck,
    -A
     
    Armin Roehrl, Dec 8, 2003
    #5
  6. Martin Pirker

    Ara.T.Howard Guest

    On 8 Dec 2003, Martin Pirker wrote:

    > short version: What's your prefered way to handle large data sets used by
    > Ruby scripts?


    rbtree (red-black) tree.

    is has an interface like a hash but is _always_ sorted. the sort method is
    determined by the keys '<=>' method. it has also allows lower_bound and
    upper_bound searches. it marshals to disk quite fast. log(n) for insert,
    delete, and search.

    > long version: Some time ago I had a problem where Rubys text munching
    > capabilities helped very much. Unfortunately, the data set was quite large,
    > an array of ~40000 entries, whereas every entry is itself an array of ~8
    > data (String, Int,... whatever) points. All in all the final solution ran
    > in ~2min time and iterated several times over the whole data set -> a brute
    > force approach.


    i use rbtree to store database 'tables' where

    pk -> tuple

    of 80 elements and more. it is extremely fast.

    > What surprised me, a first "more efficient" implementation which took
    > "notes" in temp structures while munching, but did require less full
    > iterations, was much slower.


    function call overhead is probably killing you. simply scanning memory is
    fast.

    > Now I got to do such a thing again, but the data set will be in the upper
    > 6-digit count.


    like i said - i've done this.

    > Brute force again? Is the GC the significant speed problem with temp data?
    > Is the GC suited for xxxMb data at all? Is "mixing in" an external database
    > maybe faster?


    berkley db might help - since is uses memeory pools. guy's impl. is very
    complete.


    > Is there a reference for non source divers how Rubys basic datastructures
    > (Array, Hash, String,...) perform with basic ops - insert, delete, append,
    > search, ... Can be tested one for one, but if I know some things run at
    > order n^2 I won't even try?


    i think you need to test for you type of data. for instance, i had some data
    for which is was faster to do custom marshaling rather than rely on rbtree's
    own. this was because the data was delimited text and doing a 'join on
    delimiter marhal as string - load from string, split on delimiter' was faster
    than going over the entire rbtree

    > Any experiences to tell? :)


    my system 'caches' the result of a database query in an rbtree stored within a
    pstore in the webroot. queries are between 10,000 and 100,000 lines. the
    only speed hit is sending large pages back to the client - no probs with data
    speed.

    here is the results for an rbtree of around 65000 entries:

    ~/eg/ruby/rbtree > ruby large.rb
    insert:
    min: 0.00000298
    max: 0.02559197
    avg: 0.00001013

    search:
    min: 0.00000298
    max: 0.04867399
    avg: 0.00001075

    marshal:
    min: 0.82507992
    max: 0.82507992
    avg: 0.82507992

    load:
    min: 1.04620802
    max: 1.04620802
    avg: 1.04620802

    delete:
    min: 0.00000703
    max: 0.15283799
    avg: 0.00000857


    here is the code:

    ~/eg/ruby/rbtree > cat large.rb
    require 'rbtree'

    n = 1 << 16
    rb = RBTree.new

    # insert n entries
    insert = Stat.new :insert
    n.times do |k|
    tuple = [k, Time.now.to_f, rand]
    insert.time { rb[k] = tuple }
    end
    p insert

    # search for all n entries
    search = Stat.new :search
    n.times do |k|
    search.time { tuple = rb[k] }
    end
    p search

    # dump all n entries
    marshal = Stat.new :marshal
    dumped = marshal.time { Marshal.dump rb }
    p marshal

    # load them back up
    mload = Stat.new :load
    rb = mload.time { Marshal.load dumped }
    p mload

    # delete all n entries
    delete = Stat.new :delete
    n.times do |k|
    delete.time { rb.delete n }
    end
    p delete


    BEGIN {
    class Stat
    def initialize name
    @name = name
    @min, @max, @avg = nil,nil,nil
    end
    def update! t
    @min = t if @min.nil? or t < @min
    @max = t if @max.nil? or t > @max
    @avg = (@avg.nil? ? t : (@avg + t)/2)
    end
    def time
    a = Time.now.to_f
    r = yield
    b = Time.now.to_f
    update!(b - a)
    r
    end
    def inspect
    format "%s:\n\tmin: %8.8f\n\tmax: %8.8f\n\tavg: %8.8f\n",
    @name, @min, @max, @avg
    end
    end
    }



    cheers.

    -a
    --

    ATTN: please update your address books with address below!

    ===============================================================================
    | EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
    | PHONE :: 303.497.6469
    | ADDRESS :: E/GC2 325 Broadway, Boulder, CO 80305-3328
    | STP :: http://www.ngdc.noaa.gov/stp/
    | NGDC :: http://www.ngdc.noaa.gov/
    | NESDIS :: http://www.nesdis.noaa.gov/
    | NOAA :: http://www.noaa.gov/
    | US DOC :: http://www.commerce.gov/
    |
    | The difference between art and science is that science is what we
    | understand well enough to explain to a computer.
    | Art is everything else.
    | -- Donald Knuth, "Discover"
    |
    | /bin/sh -c 'for l in ruby perl;do $l -e "print \"\x3a\x2d\x29\x0a\"";done'
    ===============================================================================
     
    Ara.T.Howard, Dec 8, 2003
    #6
  7. Martin Pirker

    Jim Freeze Guest

    On Monday, 8 December 2003 at 22:20:00 +0900, Armin Roehrl wrote:
    > how large is the data?
    > what do you do with the data?
    >
    > We looked at some logfiles (~ 100 gigaybtes; can't remember exact size)


    How do you read a file larger than 4 GB?


    --
    Jim Freeze
    ----------
    "I have a very firm grasp on reality! I can reach out and strangle it
    any time!"
     
    Jim Freeze, Dec 9, 2003
    #7
  8. Ara.T.Howard <> wrote:
    > On 8 Dec 2003, Martin Pirker wrote:
    >> short version: What's your prefered way to handle large data sets used by
    >> Ruby scripts?

    >
    > rbtree (red-black) tree.
    >
    > is has an interface like a hash but is _always_ sorted. the sort method is
    > determined by the keys '<=>' method. it has also allows lower_bound and
    > upper_bound searches. it marshals to disk quite fast. log(n) for insert,
    > delete, and search.


    very interesting
    a way to hold a large working set of data, ordered and fast ops

    looking at the README of rbtree, the extra methods "lower_bound,
    upper_bound" (and others?) provided by rbtree don't seem to be
    mentioned - is there a docu explaining them or guess from the test code?


    > simply scanning memory is fast.


    with numerical values I agree

    however with mixed data taking notes of previous iteratings "feels"
    cheaper to me

    this approaches the border of "fit data to language data structures" or
    "structure data best matched to problem domain"
    I'm torn
    as rather Ruby newbie I'm still in the transition to Ruby thinking, so
    for me it's rather tapping in the dark and banging the head sometimes ;-)


    thanks for your suggestions

    Martin
     
    Martin Pirker, Dec 9, 2003
    #8
  9. Martin Pirker wrote:
    > looking at the README of rbtree, the extra methods "lower_bound,
    > upper_bound" (and others?) provided by rbtree don't seem to be
    > mentioned - is there a docu explaining them or guess from the test code?


    irb(main):002:0> t = RBTree.new
    => #<RBTree: {}, default=nil, compare=nil>
    irb(main):003:0> t[1] = "one"
    => "one"
    irb(main):004:0> t[4] = "four"
    => "four"
    irb(main):005:0> t[7] = "seven"
    => "seven"
    irb(main):006:0> t.lower_bound(3)
    => [4, "four"]
    irb(main):007:0> t.upper_bound(3)
    => [1, "one"]

    So my guess is lower_bound is the "greatest lower bound" for the set of
    keys above the argument.

    http://wombat.doc.ic.ac.uk/foldoc/foldoc.cgi?greatest lower bound
     
    Joel VanderWerf, Dec 9, 2003
    #9
  10. "Martin Pirker" <> schrieb im Newsbeitrag
    news:3fd5672f$0$11742$...
    > > simply scanning memory is fast.

    >
    > with numerical values I agree
    >
    > however with mixed data taking notes of previous iteratings "feels"
    > cheaper to me


    Maybe it helps if you give more information about the nature of the
    problem you are trying to solve.

    Regards

    robert
     
    Robert Klemme, Dec 9, 2003
    #10
  11. Martin Pirker

    Ian Hobson Guest

    In message <>, Jim Freeze
    <> writes
    >On Monday, 8 December 2003 at 22:20:00 +0900, Armin Roehrl wrote:
    >> how large is the data?
    >> what do you do with the data?
    >>
    >> We looked at some logfiles (~ 100 gigaybtes; can't remember exact size)

    >
    >How do you read a file larger than 4 GB?
    >

    Like the elephant? One byte at a time :)

    >


    --
    Ian - posting to a Newsgroup. Please remove everything to reply.
     
    Ian Hobson, Dec 9, 2003
    #11
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Eric Lilja
    Replies:
    9
    Views:
    378
    Old Wolf
    May 26, 2005
  2. nish
    Replies:
    1
    Views:
    340
  3. CMOS
    Replies:
    15
    Views:
    488
    James Kanze
    May 17, 2007
  4. Patrick  Sullivan
    Replies:
    6
    Views:
    294
    Carl Banks
    Sep 27, 2008
  5. Terry L. Ridder
    Replies:
    4
    Views:
    138
    Quantum Mechanic
    Oct 14, 2003
Loading...

Share This Page