Usage patterns and trusted libraries

David Rush · Jul 26, 2007

Hi y'all,

This may end up being regarded as an incendiary posting, but it's not
meant to be. This is just an observation from a relative Ruby (in
general) and Rails (in particular) newb.

So I'm beavering away at my lovely little start-up desk and really
rather enjoying Ruby (in between the moments of utter frustration

and I start coding up some ETL processes to load and merge masses of
data into my bouncing baby web-system. And all is relatively good
until I get to my first tricky merge process where I have to
disambiguate names and otherwise harmonize my various data sources.

The process takes over 12 hours to run using ActiveRecord to provide
my DB access. For 5500 records.

I tweak DB indices. I get out CachedModel. I read a lot of code and
eat heaps of metaprogrammed object spaghetti. I run benchmarks. And I
finally conclude that my access patterns are totally defeating the
metaprogramming and requiring excessive DB traffic- even though I
can't really prove it.

So eventually I rewrote the program using a different language (which
I don't mention to avoid starting a flame-war - I could have used Ruby
on top of the MySQL interfaces) with a cache strategy that is better
suited to the DB access pattern.

The new process takes 55 *seconds*.

The moral of the story: there isn't one really. If I was 100% sure
that I wouldn't need to re-run the data (either because of
undiscovered bugs or b0rk3n data from the vendor) the 12 hour run
would have been an efficient use of my time, probably. But performance
does matter when it has an impact on the amount of time I have to
spend waiting for critical-path processes. If there is a moral, it is
simply: know your tools. And that community excitement doesn't
substitute for good documentation.

Anyway, I am still happy with both Ruby and Rails. But this was a
lovely opportunity to re-learn a lesson I've learned too many times
before.

david rush

Rob Biedenharn · Jul 26, 2007

Hi y'all,

This may end up being regarded as an incendiary posting, but it's not
meant to be. This is just an observation from a relative Ruby (in
general) and Rails (in particular) newb.

So I'm beavering away at my lovely little start-up desk and really
rather enjoying Ruby (in between the moments of utter frustration
and I start coding up some ETL processes to load and merge masses of
data into my bouncing baby web-system. And all is relatively good
until I get to my first tricky merge process where I have to
disambiguate names and otherwise harmonize my various data sources.

The process takes over 12 hours to run using ActiveRecord to provide
my DB access. For 5500 records.

david rush

Although it may be too late, might I suggest that

ActiveWarehouse ETL

could be a good place to (re-)start?

The Rubyforge site for it is:
http://rubyforge.org/frs/?group_id=2435

-Rob

Rob Biedenharn http://agileconsultingllc.com
(e-mail address removed)

James Britt · Jul 27, 2007

David said:
Hi y'all,
Hi!

So eventually I rewrote the program using a different language (which
I don't mention to avoid starting a flame-war - I could have used Ruby
on top of the MySQL interfaces) with a cache strategy that is better
suited to the DB access pattern.

The new process takes 55 *seconds*.

So, the question is, why did you pick ActiveRecrd in the first place,
and could there have been a better vetting process to select the most
appropriate DB tool?

The moral of the story: there isn't one really. If I was 100% sure
that I wouldn't need to re-run the data (either because of
undiscovered bugs or b0rk3n data from the vendor) the 12 hour run
would have been an efficient use of my time, probably. But performance
does matter when it has an impact on the amount of time I have to
spend waiting for critical-path processes. If there is a moral, it is
simply: know your tools. And that community excitement doesn't
substitute for good documentation.

Second moral: There are many ways to build apps in Ruby; Rails is but
one. A few hours of research can save many more hours later on.

--
James Britt

"A language that doesn't affect the way you think about programming is
not worth knowing."
- A. Perlis

Hal Fulton · Jul 27, 2007

David said:
Hi y'all,

This may end up being regarded as an incendiary posting, but it's not
meant to be. This is just an observation from a relative Ruby (in
general) and Rails (in particular) newb.

[snip]

Facts shouldn't be considered incendiary. I thought your account
sounded fair.

I'm not surprised that AR is slower than direct db access. I am
surprised if it is THAT much slower. But of course, you can do
direct db access in any language, including Ruby.

Direct access to the database probably isn't terribly railsy,
of course... but a factor of 700 slowdown is obviously
unacceptable.

And the fact that is such a large factor makes me think that
*something* must be wrong. I am sure this is not the typical
person's experience...

Hal

David Rush · Jul 27, 2007

Direct access to the database probably isn't terribly railsy,
of course... but a factor of 700 slowdown is obviously
unacceptable.

And the fact that is such a large factor makes me think that
*something* must be wrong. I am sure this is not the typical
person's experience...

Yes. The more I think about it, the more astonished I am. Especially
when my benchmarks which did straight row reads from the DB ended up
running within 20% of each other. I think the main source is a lot of
the 'convenience' features of the rails environment. As an example, I
know that I put in a lot of effort to stop Rails from introspecting
the columns of my habtm's - this is a *huge* time-waster, even if it
is handy when you're rapidly prototyping. Perhaps I should have moved
over to a 'production' environment?

I'm sure that the single biggest thing was being able to directly
implement a cache policy that suited the application, rather than
manipulating it at the long end of a long pole. That fact is probably
also indicative of a number of other small multiplicative factor
errors (many of which are my fault I'm sure) which when compounded
cause the explosion in processing time. Nearly 3 *decimal* orders of
magnitude is just a *huge* margin - if I didn't *know* that I used the
*same* algorithms, I'd be assuming that they must have been changed.

And that's why I just threw it out as a data point. I'm a fairly
experienced professional. What I found frustrating was the difficulty
I had in discovering my performance issues - which I am sure I can't
all lay at the feet of RoR. In my private musings after I posted last
night, I recalled the 'agile development' value expressed in the
introduction of _Agile Web Development with Rails_ that prefers
working code to extensive documentation. Well I know of a few managers
in my day that could have learned to moderate their stance a bit based
on that advice, but the flip side is that documentation is crucial to
re-usability. But that's a rabbit-hole I don't particularly want to
explore today.

david rush

David Rush · Jul 27, 2007

<snip>

Although it may be too late, might I suggest that
ActiveWarehouse ETL

Thank you. I will definitely take a look at it. There's a lot more ETL
to do

david rush

Rimantas Liubertas · Jul 27, 2007

Perhaps I should have moved

over to a 'production' environment?

Duh.

Regards,
Rimantas

David Rush · Jul 27, 2007

Duh.

Perhaps I should have also included the implicit sarcasm implied by
the necessity of distinguishing a dedicated 'production' environment.
Good QA practice will nearly always tell you that test and production
should be as alike as possible - this also extends to dev unless you
want to spend a lot of time chasing issues on massively unfamiliar
systems.

Would production have given me a speed-up of 3 decimal orders of
magnitude?

david rush

Robert Klemme · Jul 27, 2007

2007/7/27 said:
So I'm beavering away at my lovely little start-up desk and really
rather enjoying Ruby (in between the moments of utter frustration
and I start coding up some ETL processes to load and merge masses of
data into my bouncing baby web-system. And all is relatively good
until I get to my first tricky merge process where I have to
disambiguate names and otherwise harmonize my various data sources.

The process takes over 12 hours to run using ActiveRecord to provide
my DB access. For 5500 records.

I tweak DB indices. I get out CachedModel. I read a lot of code and
eat heaps of metaprogrammed object spaghetti. I run benchmarks. And I
finally conclude that my access patterns are totally defeating the
metaprogramming and requiring excessive DB traffic- even though I
can't really prove it.

So eventually I rewrote the program using a different language (which
I don't mention to avoid starting a flame-war - I could have used Ruby
on top of the MySQL interfaces) with a cache strategy that is better
suited to the DB access pattern.

The new process takes 55 *seconds*.

Makes me wonder: why did you choose a different language? As you said
yourself, you could have implemented the same strategy in Ruby as
well. Also, it seems for 5500 records you do not need a caching
strategy - you could just slurp in all the stuff into mem, do your
transformations and write it back. It seems with this approach even
AR would have provided sufficient performance, wouldn't it?

Kind regards

robert

Peña, Botp · Jul 27, 2007

From: David Rush [mailto:[email protected]]=20
# The process takes over 12 hours to run using ActiveRecord to provide
# my DB access. For 5500 records.

at 7 sec per record, that is interestingly terrible

=20
can you post sample code?

kind regards -botp

David Rush · Jul 28, 2007

Makes me wonder: why did you choose a different language?

Because I found the Ruby AR code to be so heavily metaprogrammed, so
inbred, and so weakly documented (internals, not API) that after three
days of crawling around inside the code I figured that it would be
faster to hack together a weak system that was based on the queries,
rather than the relatively nice model I had to work with for the UI.
And because in other languages I know the assumptions made by the
interpreters, virtual machines and compilers in much greater detail
and can therefore accurately code for CPU-efficiency.

As you said
yourself, you could have implemented the same strategy in Ruby as
well. Also, it seems for 5500 records you do not need a caching
strategy - you could just slurp in all the stuff into mem, do your
transformations and write it back. It seems with this approach even
AR would have provided sufficient performance, wouldn't it?

One would have thought. I did slurp all 5500 (BTW, this is and
intermediate load of a total data set of 25,000 - I have another
dataset of 20,000 waiting in the wings, and a deal I'm doing for
another million or so

. Those 5500 required another 26000+ records
of various other model types in order to get them properly embedded in
the model - which is still not a big deal. However, slurping the
*entire* DB of over 250,000 records seemed like a potentially bad
idea, given that I don't have a good feel for the memory overheads in
Ruby yet.

I had a lot of difficulty figuring out where to hook the AR find
mechanisms to avoid multiple searches for the same records. Looking
back, I think I could do it now, but that is also because I have
dropped my expectation of what AR (or various net-available plugins)
will do and would just write my own access layer on top of AR.

Don't you just love the perpetual recurrence of the glueware-on-
glueware antipattern?

david rush

David Rush · Jul 28, 2007

From: David Rush [mailto:[email protected]]
# The process takes over 12 hours to run using ActiveRecord to provide
# my DB access. For 5500 records.

at 7 sec per record, that is interestingly terrible
can you post sample code?

Reducing the context to fit in a news article would be difficult. I
suspect that the problematic data structure is a denormalized category
tree - at 6 levels deep with and average breadth of 7, I suspect there
was a lot of reloading going on while AR loaded more data than was
needed in this particular application.

Mind you, all the loads performed by AR would have been totally useful
for the UI - so this is not so much a criticism as an engineering post-
mortem. There are very few tools that do many things well

david rush

Computer vision models and patterns	1	Nov 23, 2022
Importing/converting regex patterns	0	Apr 4, 2009
Ruby 1.9 and rdoc/usage confusion	0	Oct 23, 2012
distutils and libraries	0	Apr 23, 2013
Ruby memory usage	22	May 7, 2009
matching against a zillion patterns	17	Oct 15, 2009
Linux: using "clone3" and "waitid"	0	Oct 17, 2023
using patterns and practices libraries as com dll	0	May 28, 2007

Usage patterns and trusted libraries

David Rush

Rob Biedenharn

James Britt

Hal Fulton

David Rush

David Rush

Rimantas Liubertas

David Rush

Robert Klemme

Peña, Botp

David Rush

David Rush

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads