implementing an array based class

B

bwv549

Some of the data I deal with involves hundreds of thousands of records
where each is very predictable in the number of attributes. By going
from a hash based object to an array based object I am able to
substantially decrease memory usage. However, the array based object
I have made (see below), is really lame for two reasons:

1. It is a pain to make a different array based class
2. When I call 'flatten' on an array of these objects, my objects get
flattened, too (how do I prevent this?)

I realize that I could just use arrays of arrays for all the objects,
but for ease of programming I'd like to be able to access an object's
attributes by name. Does anyone know of some memory efficient and
fast way of doing this (that's insensitive to 'flatten')?

I've tried several times to create some kind of metaprogramming
version of this but always fail.

This works (as far as what I'm basically after), but it's very ugly:

ind_keys = {} ; ind_keys_w_eq = {}; @@ind = {}
## The method name to index mapping
ind_keys = {:shape => 0, :intensity => 1, :flare => 2, :alpha =>
3, :id => 4, :xposition => 5, :yposition => 6, :zposition => 7, :color
=> 8, :parent => 9 }

@@arr_size = ind_keys.size

## Make a single class hash that maps getters and setters
ind_keys.each {|k,v| ind_keys_w_eq["#{k}=".to_sym] = v }
ind_keys.merge!(ind_keys_w_eq)
ind_keys.each {|k,v| @@ind[k] = v ; @@ind["#{k}"] = v}

## Manually define getters and setters
def shape ; self[0] end ; def shape=(oth) ; self[0] = oth end
def intensity ; self[1] end ; def intensity=(oth) ; self[1] = oth
end
def flare ; self[2] end ; def flare=(oth) ; self[2] = oth end
def alpha ; self[3] end ; def alpha=(oth) ; self[3] = oth end
def id ; self[4] end ; def id=(oth) ; self[4] = oth end
def xposition ; self[5] end ; def xposition=(oth) ; self[5] = oth
end
def yposition ; self[6] end ; def yposition=(oth) ; self[6] = oth
end
def zposition ; self[7] end ; def zposition=(oth) ; self[7] = oth
end
def color ; self[8] end ; def color=(oth) ; self[8] = oth end
def parent ; self[9] end ; def parent=(oth) ; self[9] = oth end
# The number of total proteins sharing this color
def num_tot_proteins ; self[10] end ; def num_tot_proteins=(oth) ;
self[10] = oth end


## create an array of the exact object size or set from a hash
def initialize(args=nil)
super(@@arr_size.size)
if args
if args.is_a? Hash
args.each do |k,v|
self[@@ind[k]] = v
end
end
end
end

Ideas and suggestions are most welcome. Thanks.
 
A

Alex Gutteridge

Some of the data I deal with involves hundreds of thousands of records
where each is very predictable in the number of attributes. By going
from a hash based object to an array based object I am able to
substantially decrease memory usage. However, the array based object
I have made (see below), is really lame for two reasons:

1. It is a pain to make a different array based class
2. When I call 'flatten' on an array of these objects, my objects get
flattened, too (how do I prevent this?)

I realize that I could just use arrays of arrays for all the objects,
but for ease of programming I'd like to be able to access an object's
attributes by name. Does anyone know of some memory efficient and
fast way of doing this (that's insensitive to 'flatten')?
snip

Ideas and suggestions are most welcome. Thanks.

I don't know about speed/memory use, but perhaps Struct (in stdlib)
or Arrayfields:

http://raa.ruby-lang.org/project/arrayfields/

does what you want?

Alex Gutteridge

Bioinformatics Center
Kyoto University
 
B

bwv549

I don't know about speed/memory use, but perhaps Struct (in stdlib)
or Arrayfields:

http://raa.ruby-lang.org/project/arrayfields/

does what you want?

Alex Gutteridge

Bioinformatics Center
Kyoto University

Thanks for the suggestions. I've looked at both as options and took
the time to do some profiling (speed and memory). [ See the SUMMARY
at the end if you want the short answer ]
It's really only pseudo-scientific, but it's better than where I was
at before.

Memory profiling turns out to be kind of a pain to get going. I
created a little plugin to do this based on http://t-a-w.blogspot.com/2007/02/ruby-and-judy.html

module MemoryUsage

MY_START_MEMORY = {}

def mem_start
file = IO.read("/proc/#{Process.pid}/smaps")
MY_START_MEMORY['size'] = _size(file)
MY_START_MEMORY['rss'] = _rss(file)
end

def mem_stop
file = IO.read("/proc/#{Process.pid}/smaps")
size_dif = _size(file) - MY_START_MEMORY['size']
rss_dif = _rss(file) - MY_START_MEMORY['rss']
"SIZE: #{size_dif} kB, RSS: #{rss_dif} kB"
end

def _size(file_string)
file_string.scan(/^Size:\s+(\d+) kB/).map{|x| x[0].to_i}.inject{|
a,b| a+b}
end

def _rss(file_string)
file_string.scan(/^Rss:\s+(\d+) kB/).map{|x| x[0].to_i}.inject{|
a,b| a+b}
end
end

class Object
include MemoryUsage
end

## then you simply require 'memory_usage'
## and put 'mem_start' in your code at the beginning
## and 'puts mem_stop' in your code at the end

I will summarize the results. A struct is really nice and easy to
use:

MyClass = Struct.new:)att1, :att2, :att3)
myobject = MyClass.new(value1, value2, value3)
myobject.att1 = a_different_value # can get/set by method
puts myobject[0] ## puts value1 (can get/set by array index)

I made an array of about 500,000 arrays (6 values each). Then tested
each style of array for speed in object creation, object access by
keyword and object access by index (where applicable).

## MyObject (this is my object)
class MyObject < Array

def a ; self[0] ; end
def b ; self[1] ; end
def c ; self[2] ; end
def d ; self[3] ; end
def e ; self[4] ; end
def f ; self[5] ; end
def g ; self[6] ; end

end

## STRUCT CLASS
MyStructClass = Struct.new( *$fields )
# MyStructClass.new( *atts ) ## -> to create object

## NORMAL OBJECT
class NormalObject
attr_accessor( *$fields )
def initialize(*args)
(@a, @b, @c, @d, @e, @f, @g) = args
end
end


Access by index:
MyObject, Structs, and Arrays all have equivalent access times by
index.

Access by keywords:
Structs have essentially equivalent access times by keyword as by
index (blazing fast) and its about the same as for a normal object.
MyObject takes about 2 times as long to access by keyword.

Initialization:
Arrays initialize fastest followed by MyObject. Struct takes about
twice as long to initialize as MyObject (I think because you have to
pass in the field as a *list). Normal objects take the longest to
initialize (even though I'm efficiently setting all properties in one
call).

MEMORY:
Arrays take about 20 times less memory than structs or MyObject.
MyObject uses a little less memory than a struct, but it depends on
how the initialization is implemented. Normal objects (in this
experiment, mileage will vary I think) took a little more than twice
as much memory as Structs or MyObject.

I didn't test ArrayFields (even though it looks like a very useful
class) based on my understanding that each array that is created must
then be set with the fields that it will use (which seems like it
would be too much overhead for what I'm doing, although I could be
wrong).

*** SUMMARY *** (for situations with gazillions of set-length
objects):
1. don't use objects--if you can, use arrays. Faster and way leaner.
2. If you have to use an object, struct seems like a good
compromise. Simple to use, fast, extendable, fast access by keywords,
a little slow on object creation. Significantly more memory than an
array, but much less than a normal object.
3. The ugly MyObject from above is probably best of you want faster
initialization and plan to access mostly by index. However, it's not
implemented in any easy to create way. This is where someone with some
meta-programming magic might be able to really clean things up...
4. Avoid normal objects. They take a long time to initialize and
take up a lot of memory.
 
R

Robert Klemme

*** SUMMARY *** (for situations with gazillions of set-length
objects):
1. don't use objects--if you can, use arrays. Faster and way leaner.
2. If you have to use an object, struct seems like a good
compromise. Simple to use, fast, extendable, fast access by keywords,
a little slow on object creation. Significantly more memory than an
array, but much less than a normal object.
3. The ugly MyObject from above is probably best of you want faster
initialization and plan to access mostly by index. However, it's not
implemented in any easy to create way. This is where someone with some
meta-programming magic might be able to really clean things up...
4. Avoid normal objects. They take a long time to initialize and
take up a lot of memory.

I am not sure I agree to your assessments about performance. A class is
actually the fastest if you omit the initialize definition.

Also, I am not sure whether your memory measurements are accurate: as
far as I can see you do neither invoke GC.start (in order to try to
force GC, which is an unreliable method anyway) nor GC.disable (in order
to stop GC completely and thus be able to measure allocation).

Kind regards

robert


09:58:18 [Temp]: ./create.rb
Rehearsal --------------------------------------------------
Struct 3.281000 0.000000 3.281000 ( 3.360000)
Class 0.703000 0.000000 0.703000 ( 0.840000)
Class (init) 6.375000 0.000000 6.375000 ( 7.962000)
Class (init2) 7.203000 0.000000 7.203000 ( 8.807000)
Array 1.016000 0.000000 1.016000 ( 1.351000)
Array.new 0.812000 0.000000 0.812000 ( 0.881000)
Array.new(5) 1.516000 0.000000 1.516000 ( 1.890000)
OpenStruct 4.672000 0.000000 4.672000 ( 5.969000)
---------------------------------------- total: 25.578000sec

user system total real
Struct 2.687000 0.000000 2.687000 ( 3.359000)
Class 0.797000 0.000000 0.797000 ( 0.845000)
Class (init) 6.687000 0.000000 6.687000 ( 7.929000)
Class (init2) 7.016000 0.000000 7.016000 ( 8.815000)
Array 1.125000 0.000000 1.125000 ( 1.347000)
Array.new 0.781000 0.000000 0.781000 ( 0.875000)
Array.new(5) 1.547000 0.000000 1.547000 ( 1.879000)
OpenStruct 4.750000 0.000000 4.750000 ( 5.985000)
09:59:31 [Temp]: cat create.rb
#!/usr/bin/ruby

require 'ostruct'
require 'benchmark'
include Benchmark

St1 = Struct.new :f1, :f2, :f3, :f4, :f5

class St2; attr_accessor :f1, :f2, :f3, :f4, :f5; end

class St3
attr_accessor :f1, :f2, :f3, :f4, :f5

def initialize(f1=nil, f2=nil, f3=nil, f4=nil, f5=nil)
@f1=f1
@f2=f2
@f3=f3
@f4=f4
@f5=f5
end
end

class St4
attr_accessor :f1, :f2, :f3, :f4, :f5

def initialize(*a)
@f1, @f2, @f3, @f4, @f5 = a
end
end

REP = 1_000_000

bmbm(15) do |re|
re.report("Struct") { REP.times { St1.new } }
re.report("Class") { REP.times { St2.new } }
re.report("Class (init)") { REP.times { St3.new } }
re.report("Class (init2)") { REP.times { St4.new } }
re.report("Array") { REP.times { [] } }
re.report("Array.new") { REP.times { Array.new } }
re.report("Array.new(5)") { REP.times { Array.new(5) } }
re.report("OpenStruct") { REP.times { OpenStruct.new } }
end
09:59:33 [Temp]:
 
B

bwv549

Thanks for the pointers. I'd love to know how to do proper memory
benchmarking. It does not seem trivial, I guess.

Regarding object initialization, I guess I'm really interested in
total time to initialize the object and fill in all attributes with
values (as this is what I'm going to be facing before I can use any of
the objects). I've incorporated this aspect into a similar benchmark
for comparison:

Rehearsal --------------------------------------------------
Class1 5.666667 0.533333 6.200000 ( 3.730552)
Class2 5.983333 0.516667 6.500000 ( 3.904695)
Struct 3.950000 0.250000 4.200000 ( 2.518688)
ArrayBased 2.033333 0.350000 2.383333 ( 1.431213)
Array.new 2.100000 0.316667 2.416667 ( 1.443011)
Array [] 0.833333 0.266667 1.100000 ( 0.671233)
---------------------------------------- total: 22.800000sec

user system total real
Class1 5.550000 0.600000 6.150000 ( 3.686555)
Class2 5.716667 0.666667 6.383333 ( 3.834078)
Struct 3.883333 0.316667 4.200000 ( 2.527558)
ArrayBased 2.166667 0.266667 2.433333 ( 1.445458)
Array.new 2.150000 0.300000 2.450000 ( 1.474190)
Array [] 0.900000 0.233333 1.133333 ( 0.684567)

----------------------------------------------------

#!/usr/bin/ruby

require 'benchmark'
include Benchmark

attribute_array = (1...7).to_a

class Class1
def initialize(attribute_array)
(@f1, @f2, @f3, @f4, @f5, @f6, @f7) = attribute_array
end
end

class Class2
def initialize(f1=nil, f2=nil, f3=nil, f4=nil, f5=nil, f6=nil,
f7=nil)
@f1=f1
@f2=f2
@f3=f3
@f4=f4
@f5=f5
@f6=f6
@f7=f7
end
end


StructClass = Struct.new:)f1, :f2, :f3, :f4, :f5, :f6, :f7)
filled_in = StructClass.new( *attribute_array )

class ArrayBased < Array
end

filled_in = ArrayBased.new(attribute_array)


REP = 1_000_000

bmbm(15) do |re|
re.report("Class1") { REP.times { Class1.new(attribute_array) } }
re.report("Class2") { REP.times { Class2.new(*attribute_array) } }
re.report("Struct") { REP.times
{ StructClass.new(*attribute_array) } }
re.report("ArrayBased") { REP.times
{ ArrayBased.new(attribute_array) } }
re.report("Array.new") { REP.times { Array.new(attribute_array) } }
re.report("Array []") { REP.times { [*attribute_array] } }
end

----------------------------------------------------

For speed, at least, my initialize observations appear to hold (in
this benchmark, anyway).
Ordered by fastest filled object creation: Array [], ArrayBased &
Array.new, Struct, Class

Kind regards,
john
 
A

ara.t.howard

MEMORY:
Arrays take about 20 times less memory than structs or MyObject.
MyObject uses a little less memory than a struct, but it depends on
how the initialization is implemented. Normal objects (in this
experiment, mileage will vary I think) took a little more than twice
as much memory as Structs or MyObject.

I didn't test ArrayFields (even though it looks like a very useful
class) based on my understanding that each array that is created must
then be set with the fields that it will use (which seems like it
would be too much overhead for what I'm doing, although I could be
wrong).

arrayfields was specifically design for the case you're coding. in particular
it was designed to provide fast keyword access and low memory overhead - i was
using it with huge postgresql result sets which were arrays. as you point
out, arrays use far less memory that any other solution. they are also much
nicer to work with

csv = tuple.join ','

for example. anyhow, the @fields object is not copied for each array - only a
reference is held. so if one does this

fields = %w( a b c )

huge_number_tuples.each{|tuple| tuple.fields = fields}

then ALL tuples share the fields object - a copy is not taken. anyhow, i'd
love to see your tests using it.

regards.


-a
 
B

bwv549

i'd love to see your tests using it.

Thanks for the insight. I've included an arrayfields implementation
on the above benchmark:

Rehearsal --------------------------------------------------
ArrayFields 8.433333 0.800000 9.233333 ( 5.585626)
Class1 5.566667 0.500000 6.066667 ( 3.631757)
Class2 6.050000 0.550000 6.600000 ( 3.963679)
Struct 3.583333 0.300000 3.883333 ( 2.335911)
ArrayBased 2.033333 0.266667 2.300000 ( 1.381698)
Array.new 1.950000 0.333333 2.283333 ( 1.371351)
Array [] 0.833333 0.316667 1.150000 ( 0.689010)
---------------------------------------- total: 31.516667sec

user system total real
ArrayFields 8.516667 0.816667 9.333333 ( 5.609708)
Class1 5.366667 0.650000 6.016667 ( 3.617349)
Class2 5.933333 0.566667 6.500000 ( 3.893614)
Struct 3.616667 0.266667 3.883333 ( 2.360849)
ArrayBased 1.933333 0.316667 2.250000 ( 1.374079)
Array.new 1.883333 0.333333 2.216667 ( 1.370410)
Array [] 0.866667 0.283333 1.150000 ( 0.696223)


Here's how I tested it (let me know if this is done incorrectly):

require 'arrayfields'
fields = %w(f1 f2 f3 f4 f5 f6 f7)

bmbm(15) do |re|
re.report("ArrayFields") { REP.times { x = [*attribute_array];
x.fields = fields } }
## .... same as before
end

I haven't tested the memory on it, but it would appear to be
comparable to an Array (very low memory usage) from your
implementation. It does have slightly higher overhead in the time
required for creation (if I did it correctly). Nonetheless,
arrayfields seems ideal for database work as you can simply label the
fields without creating new classes for everything. In the particular
case before me, I have only a handful of object types, but lots of
them.

Thanks,
john
 
B

bwv549

Based on the model I was using above for an array based object
(inherits from array) and similar to the way Struct works to
initialize it, I've worked up this guy (with some effor):


class ArrayClass
def self.create(fields)
new_class = Class.new(Array)
new_class.class_eval('undef_method :flatten')
new_class.class_eval('undef_method :flatten!')
new_class.class_eval("@@keyword_to_index = {}")
fields.each_with_index do |field,i|
field_sym = field.to_sym
## THE GETTERS AND SETTERS:
new_class.class_eval("def #{field}() self[#{i}] end")
new_class.class_eval("def #{field}=(val) self[#{i}]=val end")

new_class.class_eval("@@keyword_to_index[:#{field}] = #{i}")
end
new_class.class_eval('def self.indices() @@keyword_to_index end')
new_class.class_eval('def indices() @@keyword_to_index end')
new_class
end
end

## 'flatten' wreaks havoc on objects that inherit from array. I've
tried in different ways to make the above object insensitive to
flatten, with no real luck. It will break code with other array-based
objects that want to be flattened, but allows us to use flatten on
arrays of our objects:

class Array
## WARNING! I've redefined Array#flatten to only work on Array
objects (not
# derived objects)
def flatten
new_array = []
self.each do |v|
if v.class == Array
new_array.push( *(v.flatten) )
else
new_array << v
end
end
new_array
end

## WARNING! I've redefined Array#flatten! to only work on Array
objects (not
# derived objects)
def flatten!
self.replace(flatten)
end
end

## USAGE:
MyClass = ArrayClass.create( %w(f1 f2 f3 f4 f5) )
obj = MyClass.new( %w(v1 v2 v3 v4 v5) )
puts obj.f1
obj.f2 = 'something_else'
MyClass.indices # -> hash giving index based on keyword.
obj.indices # -> same

I've written some tests and it seems to fly.
 
A

ara.t.howard

i'd love to see your tests using it.

Thanks for the insight. I've included an arrayfields implementation
on the above benchmark:

Rehearsal --------------------------------------------------
ArrayFields 8.433333 0.800000 9.233333 ( 5.585626)
Class1 5.566667 0.500000 6.066667 ( 3.631757)
Class2 6.050000 0.550000 6.600000 ( 3.963679)
Struct 3.583333 0.300000 3.883333 ( 2.335911)
ArrayBased 2.033333 0.266667 2.300000 ( 1.381698)
Array.new 1.950000 0.333333 2.283333 ( 1.371351)
Array [] 0.833333 0.316667 1.150000 ( 0.689010)
---------------------------------------- total: 31.516667sec

user system total real
ArrayFields 8.516667 0.816667 9.333333 ( 5.609708)
Class1 5.366667 0.650000 6.016667 ( 3.617349)
Class2 5.933333 0.566667 6.500000 ( 3.893614)
Struct 3.616667 0.266667 3.883333 ( 2.360849)
ArrayBased 1.933333 0.316667 2.250000 ( 1.374079)
Array.new 1.883333 0.333333 2.216667 ( 1.370410)
Array [] 0.866667 0.283333 1.150000 ( 0.696223)


Here's how I tested it (let me know if this is done incorrectly):

require 'arrayfields'
fields = %w(f1 f2 f3 f4 f5 f6 f7)

bmbm(15) do |re|
re.report("ArrayFields") { REP.times { x = [*attribute_array];
x.fields = fields } }
## .... same as before
end

I haven't tested the memory on it, but it would appear to be comparable to
an Array (very low memory usage) from your implementation. It does have
slightly higher overhead in the time required for creation (if I did it
correctly). Nonetheless, arrayfields seems ideal for database work as you
can simply label the fields without creating new classes for everything. In
the particular case before me, I have only a handful of object types, but
lots of them.

seems like you're realy comparing apples to oranges here. for instance your
Class1 contructor only every takes a copy of 'attribute_array', while some
ctors take copies, or even several. i made a mod so all calles to
attribute_array created a new array so you can actually measure the speed of
the ctor not the speed of copying vs not copying an array (we know which one is
faster!). and not all your implimentations provide keyword access - but i
guess you know that...

also i used the new (version 3.7.0) arrayfields interface to build a class
that way. here are the results


harp:~ > cat a.rb
#! /usr/bin/env ruby

require 'rubygems'
require 'arrayfields' # 3.7.0!
require 'benchmark'
include Benchmark

class Class1
def initialize(attribute_array)
(@f1, @f2, @f3, @f4, @f5, @f6, @f7) = attribute_array
end
end

class Class2
def initialize(f1=nil, f2=nil, f3=nil, f4=nil, f5=nil, f6=nil, f7=nil)
@f1=f1
@f2=f2
@f3=f3
@f4=f4
@f5=f5
@f6=f6
@f7=f7
end
end

StructClass = Struct.new:)f1, :f2, :f3, :f4, :f5, :f6, :f7)

class ArrayBased < Array
end

=begin
#
# this is all Array.fields does
#
class ArrayFieldsBased < Array
FIELDS = ArrayFields::FieldSet.new [:f1, :f2, :f3, :f4, :f5, :f6, :f7]
include ArrayFields
def initialize *a, &b
super
ensure
@fieldset = FIELDS
end
end
=end

ArrayFieldsBased = Array.fields :f1, :f2, :f3, :f4, :f5, :f6, :f7

def attribute_array() (1..7).to_a end

REP = 1_000_00 #_000

bmbm(15) do |re|
re.report("ArrayFieldsBased") { REP.times { ArrayFieldsBased.new attribute_array } }
re.report("Class1") { REP.times { Class1.new attribute_array } }
re.report("Class2") { REP.times { Class2.new *attribute_array} }
re.report("Struct") { REP.times { StructClass.new *attribute_array } }
re.report("ArrayBased") { REP.times { ArrayBased.new attribute_array } }
re.report("Array.new") { REP.times { Array.new attribute_array } }
re.report("Array []") { REP.times { [*attribute_array] } }
end


and here's the results



[ahoward@localhost ~]$ ruby a.rb
Rehearsal ----------------------------------------------------
ArrayFieldsBased 1.440000 0.010000 1.450000 ( 1.709076)
Class1 1.550000 0.020000 1.570000 ( 6.274899)
Class2 1.680000 0.000000 1.680000 ( 1.945875)
Struct 1.230000 0.010000 1.240000 ( 1.486386)
ArrayBased 1.000000 0.000000 1.000000 ( 1.309320)
Array.new 0.980000 0.000000 0.980000 ( 1.178536)
Array [] 0.750000 0.000000 0.750000 ( 0.935874)
------------------------------------------- total: 8.670000sec

user system total real
ArrayFieldsBased 1.560000 0.000000 1.560000 ( 1.787502)
Class1 1.400000 0.010000 1.410000 ( 1.627891)
Class2 1.480000 0.000000 1.480000 ( 1.742301)
Struct 1.180000 0.010000 1.190000 ( 1.299734)
ArrayBased 0.970000 0.000000 0.970000 ( 1.050021)
Array.new 0.980000 0.000000 0.980000 ( 1.089647)
Array [] 0.740000 0.010000 0.750000 ( 0.888502)



so, now you see what you should - all approaches are within the same order of
magnitude: there is no difference.


regards.

-a
 
R

Robert Klemme

so, now you see what you should - all approaches are within the same
order of magnitude: there is no difference.

Well, factor two between slowest and fastest. Not dramatic but may be
noticeable when dealing with large quantities of objects.

But I also (?) have this "premature optimization" feeling about this
thread. And we don't have seen real memory measurements yet which
seemed to have been the root concern...

Kind regards

robert
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,767
Messages
2,569,570
Members
45,045
Latest member
DRCM

Latest Threads

Top