[SUMMARY] Records and Arrays (#170)


Matthew Moss

It is worthwhile to note that, for most real situations, this week's
quiz problem isn't necessary. From a human perspective, we naturally
organize things in aggregate, as structured records. Whether it be
object-oriented data, rows in a SQL table, XML records or other,
that's how we naturally organize things. Except for a few specialized
conditions, collections of records are sufficient for our
organizational needs as either programmers or users.

What are those specialized conditions? I can think of a couple situations:

1. *Performance.* When dealing with older hardware, or perhaps
specialized hardware (e.g. vector units on some processors or GPUs),
it can be quite beneficial to "de-organize" data. Tight, little loops
can then quickly process individual bits of data, and not be required
to either process nor hold the whole data record.

2. *Interface.* If the data must be provided to another library or
component that expects parallel bits of data (e.g. a graphing
component as suggested by the quiz), then obviously we need to provide
the data as requested.

In the quiz, I asked that first for a data moving solution. This is
often appropriate for performance situations, since an adapter does
not provide the compact data that performance may require. (For
example, some vector units may have limited memory, so it's important
to extract only the data needed for vector processing.)

The extra-credit adapter (wrapper) solution is more appropriate for
interface situations. Here, especially with Ruby, we can easily wrap a
new interface around the old, and make one set of data work in
numerous situations.

Let's start by looking at a solution that moves data into a new
"record of arrays" structure, submitted by _Andrea Fazzi_.

def aor_to_roa(arr)
hash = { }
arr.each do |record|
record.marshal_dump.each_key { |field|
(hash[field] ||= []) << record.send(field)

This function accepts arrays of `OpenStruct` instances, as specified
by the quiz. The outer loop enumerates over each record. For each
record, `marshal_dump` is called: this is an instance method of
`OpenStruct` and generates a hash consisting of field names associated
to the field values.

For each key in that hash, Andrea first ensures that the new hash is
either already initialized, or gets initialized to an empty array. The
old record is sent the field name as a message (which returns the
field value), which is appended to the new hash's array for that
field. Finally, a new `OpenStruct` instance is returned, which takes
the new hash as its content.

(On a side note... Technically, for vector units and similar
situations, you'd need to move the *actual* data into a memory block,
rather than mere copying of object references as done above... but the
code is sufficient for displaying the basic idea.)

I'm going to pass on talking about the opposite function: that is,
moving data from a "record of arrays" to an "array of records." Given
that we usually store data as an "array of records," this is less
important since we can refer to the original data. Still, take a look
at Andrea's `roa_to_aor` method, which maintains a similar approach
and simplicity.

I'm also not going to talk about pushing updates back to the original
data for this case. Generally, this isn't desireable for the typical
cases (i.e. performance) that move/duplicate data. Either the data
isn't being modified, or if being modified, it gets pushed out the end
of the pipe and *intentionally* lost. (Had I considered this when
writing the quiz, I would have skipped my request for the `roa_to_aor`

Situations that focus on interface, however, will prefer to work with
the original data. In fact, it's important that the adapter allow the
original data to be modified to the extent it was originally
modifiable: read-only data should not suddenly become read-write.
(Another thing I should have put into the original quiz description...

First, I want to look at the adapter from _James Gray_, which is quite
simple and satifies the quiz description as was written. (I've left
out the method `__enum__`, since it only existed to support

class MapToEnum
instance_methods.each { |meth|
undef_method(meth) unless meth =~ /\A__/

def initialize(enum)
@enum = enum

def method_missing(meth, *args, &block)
@enum.map { |o| o.send(meth, *args, &block) }

The first line undefines almost all the instance methods of this class
(excepting those that start with two underscores), and does so
immediately. To see what this leaves behind, try the following in irb:
class WhatIsLeft
instance_methods.each { |m| undef_method(m) unless m =~ /\A__/ }

=> ["__id__", "__send__"]

Basically, the intent here is that `MapToEnum` do very little on its
own: most work will be passed onto the data, which we'll see in a
moment. Handly little thing... at least when using the class, but
better to comment it out while debugging it.

The `MapToEnum` initializer simply remembers the array passed in,
which is then used in `method_missing`. Here, any methods called
except those above are sent to each of the objects in `@enum`. Using
`@enum.map` means that arrays will be returned. And because all
instance methods were cleared out first, you can do some other things
besides getting the field names.

# Assuming people is an array of Person structs...
peeps = MapToEnum.new(people)
=> ["James", "Dana", "Matthew"]
=> "Matthew"
=> [Person, Person, Person]

What doesn't work with James' solution is modifying the data.
peeps.name[2] = "Bob"
=> "Bob"
=> ["James", "Dana", "Matthew"]
people.map { |x| x.name }
=> ["James", "Dana", "Matthew"]

This is because the attempt to modify `peeps.name[2]` actually changes
the reference contained in the third slot of the array generated by
`peeps.name`. To change the source data, `peeps.name[2]` would need to
represent the `name` field of the original record.

Let's look at one more solution, this one from _Frederick Cheung_.

class ArrayOfRecordsWrapper
def initialize(array)
@array = array

def method_missing(name, *args)
FieldProxy.new(@array, name)

class FieldProxy
instance_methods.each { |m| undef_method m unless m =~ /(^__)/ }

def initialize array, name
@array, @name = array, name

def [](index)
@array[index].send @name

def []=(index,value)
@array[index].send(@name.to_s + '=', value)

def to_a
@field_array = @array.collect {|a| a.send @name}

def method_missing(*args, &block)
to_a.send *args, &block

Frederick's solution is, in many ways, similar to James' solution,
though Frederick adds in a proxy object to manage named fields. In
fact, if you remove the square bracket methods `[]` and `[]=` in the
proxy, simplify and refactor, you'd essentially have James' solution.

Yet, Frederick's solution _does_ work for manipulating the data.
=> "Matthew"
peeps = ArrayOfRecordsWrapper.new(people)
=> ["James", "Dana", "Matthew"]
peeps.name[2] = "Bob"
=> "Bob"
=> ["James", "Dana", "Bob"]
=> "Bob"

This works specifically because Frederick provided the square bracket
methods, which manipulate the original data directly, rather than
through the array generated from `to_a` and `method_missing`. If we
attempt to change the original data through those (rather than the
proxy supplied square bracket methods), it won't work.
=> ["James", "Dana", "Bob"]
peeps.name.to_a[2] = "Matthew"
=> "Matthew"
=> ["James", "Dana", "Bob"] # not changed!

It might be worthwhile to hide `to_a`, though I expect that's not the
only way to circumvent the goal of modifying the original data. (And
with a dynamic language such as Ruby, I'm pretty sure of it.) If we
set a goal of defining and supporting common usage, then Frederick's
solution is a pretty satisfactory one.


Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question