handling large data sets

Martin Pirker · Dec 8, 2003

Hi...

short version:
What's your prefered way to handle large data sets used by Ruby scripts?

long version:
Some time ago I had a problem where Rubys text munching capabilities
helped very much. Unfortunately, the data set was quite large, an array
of ~40000 entries, whereas every entry is itself an array of ~8 data
(String, Int,... whatever) points.
All in all the final solution ran in ~2min time and iterated several
times over the whole data set -> a brute force approach.
What surprised me, a first "more efficient" implementation which took
"notes" in temp structures while munching, but did require less full
iterations, was much slower.
Now I got to do such a thing again, but the data set will be in the
upper 6-digit count.

Brute force again?
Is the GC the significant speed problem with temp data?
Is the GC suited for xxxMb data at all?
Is "mixing in" an external database maybe faster?
....

Is there a reference for non source divers how Rubys basic
datastructures (Array, Hash, String,...) perform with basic
ops - insert, delete, append, search, ...
Can be tested one for one, but if I know some things run at order n^2
I won't even try?

Any experiences to tell?

Martin

Gavin Sinclair · Dec 8, 2003

Hi...

short version:
What's your prefered way to handle large data sets used by Ruby scripts?

[...]

Click to expand...

Any experiences to tell?

Martin,

I can't answer the specific technical questions, but the one
experience I had with using Ruby to process large amounts of data
involved went something like this. There was heaps of data in one set
of files, and heaps of data in another set, and I had to cross-match
the datasets to prepare data for insertion in a database.

But I needed to run it regularly on growing, rather than different,
files. I can't remember what was particularly expensive, but I saved
a lot of time by caching pre-calculated results in a Marshal dump, so
that each time the program was run, it could work out what data it did
and didn't need to examine.

The data I was marshalling was slow in 1.6 and fast in 1.8.

It sounds like your data is an order of magnitude larger than mine,
and probably a different problem, but I'm sure some use of Marshal, or
YAML. can help.

It's probably worth experimenting in loading a 10000-line file into
internal data structures vs loading the pre-saved data structures.

Cheers,
Gavin

Robert Klemme · Dec 8, 2003

Martin Pirker said:
Hi...

short version:
What's your prefered way to handle large data sets used by Ruby scripts?

long version:
Some time ago I had a problem where Rubys text munching capabilities
helped very much. Unfortunately, the data set was quite large, an array
of ~40000 entries, whereas every entry is itself an array of ~8 data
(String, Int,... whatever) points.
All in all the final solution ran in ~2min time and iterated several
times over the whole data set -> a brute force approach.
What surprised me, a first "more efficient" implementation which took
"notes" in temp structures while munching, but did require less full
iterations, was much slower.
Now I got to do such a thing again, but the data set will be in the
upper 6-digit count.

Brute force again?
Is the GC the significant speed problem with temp data?
Is the GC suited for xxxMb data at all?
Is "mixing in" an external database maybe faster?
...

Is there a reference for non source divers how Rubys basic
datastructures (Array, Hash, String,...) perform with basic
ops - insert, delete, append, search, ...
Can be tested one for one, but if I know some things run at order n^2
I won't even try?

Any experiences to tell?

Freezing Strings before using them as Hash keys helps, because normally a
Hash dup's Strings that are used as keys.

Avoid duplicating data in the "notes".

It's difficult to get more specific since we don't know the data and types
at hand.

Kind regards

robert

Gregory Millam · Dec 8, 2003

Received: Mon, 8 Dec 2003 15:42:14 +0900

And said:
short version:
What's your prefered way to handle large data sets used by Ruby
scripts?

Is "mixing in" an external database maybe faster?

If you're working with a massive number of records, but not editing each
of them in place all the time (or even if you are, it's highly
dependent) - have you considered using SQL? As external databases are
designed for managing data as large as yours, perhaps you should
consider that as an option. Especially since ruby has pretty nice
support for sql, in my experience (at least, mysql).

Armin Roehrl · Dec 8, 2003

how large is the data?
what do you do with the data?

We looked at some logfiles (~ 100 gigaybtes; can't remember exact size)
and I noted that a single-pass over the data using a ruby-script was much
faster than using mysql.

probably obvious:
Only load in memory what you absolutely need.
Experiment with the data-chunks-sizes you read in.
Turn off GC temporarily (if it is feasible)
Write a few routines in C; (SWIG rules)

good luck,
-A

Ara.T.Howard · Dec 8, 2003

short version: What's your prefered way to handle large data sets used by
Ruby scripts?

rbtree (red-black) tree.

is has an interface like a hash but is _always_ sorted. the sort method is
determined by the keys '<=>' method. it has also allows lower_bound and
upper_bound searches. it marshals to disk quite fast. log(n) for insert,
delete, and search.

long version: Some time ago I had a problem where Rubys text munching
capabilities helped very much. Unfortunately, the data set was quite large,
an array of ~40000 entries, whereas every entry is itself an array of ~8
data (String, Int,... whatever) points. All in all the final solution ran
in ~2min time and iterated several times over the whole data set -> a brute
force approach.

i use rbtree to store database 'tables' where

pk -> tuple

of 80 elements and more. it is extremely fast.

What surprised me, a first "more efficient" implementation which took
"notes" in temp structures while munching, but did require less full
iterations, was much slower.

function call overhead is probably killing you. simply scanning memory is
fast.

Now I got to do such a thing again, but the data set will be in the upper
6-digit count.

like i said - i've done this.

Brute force again? Is the GC the significant speed problem with temp data?
Is the GC suited for xxxMb data at all? Is "mixing in" an external database
maybe faster?

berkley db might help - since is uses memeory pools. guy's impl. is very
complete.

Is there a reference for non source divers how Rubys basic datastructures
(Array, Hash, String,...) perform with basic ops - insert, delete, append,
search, ... Can be tested one for one, but if I know some things run at
order n^2 I won't even try?

i think you need to test for you type of data. for instance, i had some data
for which is was faster to do custom marshaling rather than rely on rbtree's
own. this was because the data was delimited text and doing a 'join on
delimiter marhal as string - load from string, split on delimiter' was faster
than going over the entire rbtree

Any experiences to tell?

my system 'caches' the result of a database query in an rbtree stored within a
pstore in the webroot. queries are between 10,000 and 100,000 lines. the
only speed hit is sending large pages back to the client - no probs with data
speed.

here is the results for an rbtree of around 65000 entries:

~/eg/ruby/rbtree > ruby large.rb
insert:
min: 0.00000298
max: 0.02559197
avg: 0.00001013

search:
min: 0.00000298
max: 0.04867399
avg: 0.00001075

marshal:
min: 0.82507992
max: 0.82507992
avg: 0.82507992

load:
min: 1.04620802
max: 1.04620802
avg: 1.04620802

delete:
min: 0.00000703
max: 0.15283799
avg: 0.00000857

here is the code:

~/eg/ruby/rbtree > cat large.rb
require 'rbtree'

n = 1 << 16
rb = RBTree.new

# insert n entries
insert = Stat.new :insert
n.times do |k|
tuple = [k, Time.now.to_f, rand]
insert.time { rb[k] = tuple }
end
p insert

# search for all n entries
search = Stat.new :search
n.times do |k|
search.time { tuple = rb[k] }
end
p search

# dump all n entries
marshal = Stat.new :marshal
dumped = marshal.time { Marshal.dump rb }
p marshal

# load them back up
mload = Stat.new :load
rb = mload.time { Marshal.load dumped }
p mload

# delete all n entries
delete = Stat.new :delete
n.times do |k|
delete.time { rb.delete n }
end
p delete

BEGIN {
class Stat
def initialize name
@name = name
@min, @max, @avg = nil,nil,nil
end
def update! t
@min = t if @min.nil? or t < @min
@max = t if @max.nil? or t > @max
@avg = (@avg.nil? ? t : (@avg + t)/2)
end
def time
a = Time.now.to_f
r = yield
b = Time.now.to_f
update!(b - a)
r
end
def inspect
format "%s:\n\tmin: %8.8f\n\tmax: %8.8f\n\tavg: %8.8f\n",
@name, @min, @max, @avg
end
end
}

cheers.

-a
--

ATTN: please update your address books with address below!

===============================================================================
| EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
| PHONE :: 303.497.6469
| ADDRESS :: E/GC2 325 Broadway, Boulder, CO 80305-3328
| STP :: http://www.ngdc.noaa.gov/stp/
| NGDC :: http://www.ngdc.noaa.gov/
| NESDIS :: http://www.nesdis.noaa.gov/
| NOAA :: http://www.noaa.gov/
| US DOC :: http://www.commerce.gov/
|
| The difference between art and science is that science is what we
| understand well enough to explain to a computer.
| Art is everything else.
| -- Donald Knuth, "Discover"
|
| /bin/sh -c 'for l in ruby perl;do $l -e "print \"\x3a\x2d\x29\x0a\"";done'
===============================================================================

Jim Freeze · Dec 9, 2003

how large is the data?
what do you do with the data?

We looked at some logfiles (~ 100 gigaybtes; can't remember exact size)

How do you read a file larger than 4 GB?

Martin Pirker · Dec 9, 2003

Ara.T.Howard said:
rbtree (red-black) tree.

is has an interface like a hash but is _always_ sorted. the sort method is
determined by the keys '<=>' method. it has also allows lower_bound and
upper_bound searches. it marshals to disk quite fast. log(n) for insert,
delete, and search.

very interesting
a way to hold a large working set of data, ordered and fast ops

looking at the README of rbtree, the extra methods "lower_bound,
upper_bound" (and others?) provided by rbtree don't seem to be
mentioned - is there a docu explaining them or guess from the test code?

simply scanning memory is fast.

with numerical values I agree

however with mixed data taking notes of previous iteratings "feels"
cheaper to me

this approaches the border of "fit data to language data structures" or
"structure data best matched to problem domain"
I'm torn
as rather Ruby newbie I'm still in the transition to Ruby thinking, so
for me it's rather tapping in the dark and banging the head sometimes ;-)

thanks for your suggestions

Martin

Joel VanderWerf · Dec 9, 2003

Martin said:
looking at the README of rbtree, the extra methods "lower_bound,
upper_bound" (and others?) provided by rbtree don't seem to be
mentioned - is there a docu explaining them or guess from the test code?

irb(main):002:0> t = RBTree.new
=> #<RBTree: {}, default=nil, compare=nil>
irb(main):003:0> t[1] = "one"
=> "one"
irb(main):004:0> t[4] = "four"
=> "four"
irb(main):005:0> t[7] = "seven"
=> "seven"
irb(main):006:0> t.lower_bound(3)
=> [4, "four"]
irb(main):007:0> t.upper_bound(3)
=> [1, "one"]

So my guess is lower_bound is the "greatest lower bound" for the set of
keys above the argument.

http://wombat.doc.ic.ac.uk/foldoc/foldoc.cgi?greatest+lower+bound

Robert Klemme · Dec 9, 2003

Martin Pirker said:
with numerical values I agree

however with mixed data taking notes of previous iteratings "feels"
cheaper to me

Maybe it helps if you give more information about the nature of the
problem you are trying to solve.

Regards

robert

Ian Hobson · Dec 9, 2003

Jim Freeze said:
How do you read a file larger than 4 GB?

Like the elephant? One byte at a time

Using gnuplot to plot large data set under Windows XP	0	Nov 11, 2008
Why different sums for these two functions for large sets of data?	9	May 24, 2005
Confused about benchmarks calculating averages of large arrays.	3	May 26, 2008
Semi OT: Uniquely Identifying Substrings for an Elem in a Set: substr, Sets and Complexity	6	Aug 21, 2005
revisiting memory/speed of an Array based class	1	Apr 3, 2008
[SUMMARY] Making Change (#154)	0	Jan 31, 2008
[SUMMARY] Verbal Arithmetic (#128)	0	Jun 21, 2007
[SUMMARY] Pen and Paper (#90)	6	Aug 17, 2006

handling large data sets

Martin Pirker

Gavin Sinclair

Robert Klemme

Gregory Millam

Armin Roehrl

Ara.T.Howard

Jim Freeze

Martin Pirker

Joel VanderWerf

Robert Klemme

Ian Hobson

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads