Ruby-specific performance heuristics?

H

Hugh Sasse

I've been doing some stuff with CSV recently, having data in one
flat file that I'm trying to normalize into tables. Thus, I'm
trying to ignore things I've seen before so I don't create 2 entries
for them, and that sort of thing.

Jon Bentley points out that "Data Structures Programs" -- i.e. how the
data is "shaped" determines how the program will be designed and
perform. Often the best solution is not immediately obvious, as he
discusses. So, for example, one can test for "seen before" by

array = []

unless array.include? item
array << item
end

which is much slower for clumped data than

unless array.include? item
array.unshift item
end

or one could use a hash, or maybe a set.

So, my question is [a memeber of the set of "How long is a piece of
string?" type questions]:

Are there any heuristics for performance of *Ruby* data structures
available, which guide one to designing code to *often* have good
performance?

The answer will be very much "it depends, you neet to test it, it's
a function of data size, ...", but I suspect thee
implementation, i.e. Ruby puts useful bounds on the problem.

My searching has not turned up anything, though there are good
performance hints in the PickAxe. I'm wondering if the info
exists or if it would be worth trying to create it, or
if it is just too broad a problem to be worth it. However, since we
Rubyists often speak of the ability to put code together quickly,
having hints around on how to do so and yield efficient code seems
of some worth.

Hugh
 
R

Robert Klemme

Hugh said:
I've been doing some stuff with CSV recently, having data in one
flat file that I'm trying to normalize into tables. Thus, I'm
trying to ignore things I've seen before so I don't create 2 entries
for them, and that sort of thing.

Jon Bentley points out that "Data Structures Programs" -- i.e. how the
data is "shaped" determines how the program will be designed and
perform. Often the best solution is not immediately obvious, as he
discusses. So, for example, one can test for "seen before" by

array = []

unless array.include? item
array << item
end

which is much slower for clumped data than

unless array.include? item
array.unshift item
end

or one could use a hash, or maybe a set.

Yes, definitely.
So, my question is [a memeber of the set of "How long is a piece of
string?" type questions]:

Are there any heuristics for performance of *Ruby* data structures
available, which guide one to designing code to *often* have good
performance?

I think the major differences performance wise are in the algorithms and
data structures as such and not the language: a hash is faster for lookups
in Ruby than an array - as in any other language that has it.

The only thing that comes to mind spontaneously is that it's a good idea
to freeze a string that you are going to use as a hash key because a
String key will be duped on insertion if it is not frozen to prevent
errors caused by aliasing.
The answer will be very much "it depends, you neet to test it, it's
a function of data size, ...", but I suspect thee
implementation, i.e. Ruby puts useful bounds on the problem.

Yes, definitely.
My searching has not turned up anything, though there are good
performance hints in the PickAxe. I'm wondering if the info
exists or if it would be worth trying to create it, or
if it is just too broad a problem to be worth it. However, since we
Rubyists often speak of the ability to put code together quickly,
having hints around on how to do so and yield efficient code seems
of some worth.

Another general rule of thumb that is true with Java and also Ruby is that
object creation is comparatively expensive. I.e. "foo" + "bar" is more
expensive than "foo" << "bar" for example. There are only some exceptions
to this rule (Fixnums for example).

Kind regards

robert
 
A

Ara.T.Howard

I've been doing some stuff with CSV recently, having data in one flat file
that I'm trying to normalize into tables. Thus, I'm trying to ignore things
I've seen before so I don't create 2 entries for them, and that sort of
thing.

Jon Bentley points out that "Data Structures Programs" -- i.e. how the data
is "shaped" determines how the program will be designed and perform. Often
the best solution is not immediately obvious, as he discusses. So, for
example, one can test for "seen before" by

array = []

unless array.include? item array << item end

which is much slower for clumped data than

unless array.include? item array.unshift item end

or one could use a hash, or maybe a set.

So, my question is [a memeber of the set of "How long is a piece of string?"
type questions]:

Are there any heuristics for performance of *Ruby* data structures
available, which guide one to designing code to *often* have good
performance?

The answer will be very much "it depends, you neet to test it, it's a
function of data size, ...", but I suspect thee implementation, i.e. Ruby
puts useful bounds on the problem.

My searching has not turned up anything, though there are good performance
hints in the PickAxe. I'm wondering if the info exists or if it would be
worth trying to create it, or if it is just too broad a problem to be worth
it. However, since we Rubyists often speak of the ability to put code
together quickly, having hints around on how to do so and yield efficient
code seems of some worth.


here's some things i think of:

- use class variables of some sort : never store in an object what should be
stored in a class

- write lazy attribute getters : if your parser loads 15,000 fields from a
header but normally only 4 or 5 are use in code write the parser to parse
the field on demand and cache the result

- don't write explicit IO : use mmap and let the kernel worry about it

- use processes not threads : ipc is so easy with ruby (using local drb
objects) that it's ridiculous not to use multiple processes so
multi-processor machines can actually take best advantage of your code.
sometimes threads are better for a problem - but rarely - especially when
you count debugging time ;-)

- don't check things : assume an arg is an int - the stack trace is plenty
good to tell you it wasn't and the check slows things down

- use rbtree : it's amazing ;-)

- use sqlite : it's amazing ;-)

- ruby inline : it's amazing ;-)

- use iterator methods for filesystem operations : prefer

find('.'){|f| ... }

over

Dir['*/**'].each{|f| ... }

the seconds kills you when a dir has 100,000 files in it.

- compose things out of array, hashes, and strings whenever possible. these
objects are pretty optimized in ruby and serialize great as yaml meaning
you never need to write hand parsers.

- eliminate variable creation where possible : prefer

n = nil
obj.each{|elem| n = elem ** 2 and p n}

over

obj.each{|elem| n = elem ** 2 and p n}

- anchor every single regular expression : performance degrades
exponentially otherwise

- compile regular expressions only once unless they require variable
substitution

- cache anything possible, for instance methods dynamically generated from
method_missing : permanmently add the method to the object instead of
generating it everytime


now, i'm not saying these are 'right' - just that they are what i think of
when trying to make ruby fast.

cheers.

-a
--
===============================================================================
| email :: ara [dot] t [dot] howard [at] noaa [dot] gov
| phone :: 303.497.6469
| Your life dwells amoung the causes of death
| Like a lamp standing in a strong breeze. --Nagarjuna
===============================================================================
 
R

Robert Klemme

Ara.T.Howard said:
here's some things i think of:

- anchor every single regular expression : performance degrades
exponentially otherwise

That's not possible - especially with String#scan. Sometimes you want all
occurrences of something.
- compile regular expressions only once unless they require
variable substitution

I guess you're talking about flag /o or storage of a regexp object
somewhere which is really only needed if the regexp contains interpolation
that you want to do only once. Otherwise a regexp *is* compiled only once
and inline regexps are roughly as performant as a stored regexp and might
be easier to read (not always).
- cache anything possible, for instance methods dynamically
generated from method_missing : permanmently add the method to
the object instead of generating it everytime


now, i'm not saying these are 'right' - just that they are what i
think of when trying to make ruby fast.

In fact I agree with most of what you wrote. :)

Kind regards

robert
 
A

Ara.T.Howard

That's not possible - especially with String#scan. Sometimes you want all
occurrences of something.

true. however there nearly always __some__ anchor one can use:

harp:~ > irb
irb(main):001:0> "foo bar".scan %r/ \b\w+ /x
=> ["foo", "bar"]

a better thing would be to say "use an anchor unless you know exactly why you
shouldn't."
I guess you're talking about flag /o or storage of a regexp object somewhere
which is really only needed if the regexp contains interpolation that you
want to do only once. Otherwise a regexp *is* compiled only once and inline
regexps are roughly as performant as a stored regexp and might be easier to
read (not always).

you are quite right - guy pointed this out to me once before - for some reason
my brain hasn't retained it though ;-)

cheers.

-a
--
===============================================================================
| email :: ara [dot] t [dot] howard [at] noaa [dot] gov
| phone :: 303.497.6469
| Your life dwells amoung the causes of death
| Like a lamp standing in a strong breeze. --Nagarjuna
===============================================================================
 
R

Robert Klemme

Ara.T.Howard said:
true. however there nearly always __some__ anchor one can use:

Ah, yes, I overlooked that one.
harp:~ > irb
irb(main):001:0> "foo bar".scan %r/ \b\w+ /x
=> ["foo", "bar"]

But does this really make such a big difference? Or does this depend on
input sequence and rx? My quick hack test didn't show significant
differences.

17:29:48 [ruby]: ruby bm-rx-anchor.rb
Rehearsal -------------------------------------------------------
no anchor 4.469000 0.000000 4.469000 ( 4.465000)
1 anchor 4.515000 0.000000 4.515000 ( 4.525000)
2 anchors 4.500000 0.000000 4.500000 ( 4.520000)
--------------------------------------------- total: 13.484000sec

user system total real
no anchor 4.500000 0.000000 4.500000 ( 4.509000)
1 anchor 4.484000 0.000000 4.484000 ( 4.494000)
2 anchors 4.531000 0.000000 4.531000 ( 4.531000)

a better thing would be to say "use an anchor unless you know exactly
why you shouldn't."

Sounds good.
you are quite right - guy pointed this out to me once before - for
some reason my brain hasn't retained it though ;-)

Well, repetition helps. :))

Kind regards

robert
 
N

Nikolai Weibull

Ara.T.Howard said:
- anchor every single regular expression : performance degrades
exponentially otherwise

That's an overstatement. True, use anchors whenever possible, as they
are great hints to the engine and they certainly limit the number of
possible matches very often, but it's not like every regular expression
without anchors takes O(N^M),
nikolai
 
A

Ara.T.Howard

That's an overstatement. True, use anchors whenever possible, as they
are great hints to the engine and they certainly limit the number of
possible matches very often, but it's not like every regular expression
without anchors takes O(N^M),
nikolai

i don't think it is an overstatement. i'm not saying every regex w/o anchor
__is__ O(N^M) - but that they easily __degrade__ to that worst case more
easily w/o them. a single misplaced '.*' can do this. one time i was
examining a processing machine that uses regexes to match keys of items that
arrive over the network and are inserted into a giant in memory circular queue
(http://my.unidata.ucar.edu/content/software/ldm/archive/index.html) because
it's cpu usage had shot up from around 5 to 50% for no apparent reason. the
keys were not large (10-100 bytes) and the regexes ldm is configured with tend
to be pretty simple, things like

/F10.*OLS/
/F12.*OLS/

and these are matched against strings like

F10200008100242.OLS

but there are __many__ of these strings. in any case it took me about two
days to convice our data manager to let me re-write the regexes - they just
couldn't believe that this could kill a 3.3ghz machine, especially when the
code was written in c and the patterns and strings were so small/simple.
needless to say a few small tweaks to the expressions, anchors and minimal
modifiers mostly, and the machine was back to normal. it was hard for them to
accept even then - i had to draw an exponential graph and say "see, this is
where we were (close to zero on x-axis) and this is where we are (tinee-tiny
bit to the right, but __way__ up on the y-axis), explain big-O notation to
them, explain that regexs are, in fact, a mini computer program describing a
state machine with looping constructs and i still don't think they got it
because what they had before worked and the incoming volume hadn't changed
"that much." so, for someone like yourself that advice is a bit strong, but
for alot of people it's better the assert it since it generally ends making
them gain a better understanding of the input and write a better expression
anyhow, like

%r/ ^ \s* foo /x vs %r/ foo /x

and it (using anchors/minimal modifiers) is seldom the wrong thing to do.

in summary that advice is probably mis-placed on this list - but i'll still
hand it out in general.

cheers.

-a
--
===============================================================================
| email :: ara [dot] t [dot] howard [at] noaa [dot] gov
| phone :: 303.497.6469
| Your life dwells amoung the causes of death
| Like a lamp standing in a strong breeze. --Nagarjuna
===============================================================================
 
E

Eric Mahurin

--- "Ara.T.Howard said:
On Sat, 3 Sep 2005, Robert Klemme wrote:
=20
That's not possible - especially with String#scan.=20 Sometimes you want all
occurrences of something.
=20
true. however there nearly always __some__ anchor one can
use:
=20
harp:~ > irb
irb(main):001:0> "foo bar".scan %r/ \b\w+ /x
=3D> ["foo", "bar"]
=20
a better thing would be to say "use an anchor unless you know
exactly why you
shouldn't."

Regarding performance, just using any anchor won't necessarily
help performance. The important ones are \A (beginning of
string) and \G (where the last match stopped). You should be
able to make just about any regexp not using these use one of
these and match the same thing. For example, this:

/<non-anchored regexp>/m

is equivalent to:

/\A.*?<non-anchored regexp>/m

I used "m" (multi-line) so that "." matched anything. One
should realize that using a non-anchored (\A or \G) regexp has
the above equivalence (.*?) when matching and the associated
performance penalty. Usually, you can simplify the .*? in your
application.

Regarding your scan case above, the \G anchored way to do it
would be:

"foo bar".scan(/\G\s*\w+/)

Of course this returns spaces in the result. You could group
the \w+, but then you get another Array level which degrade
performance.

I think \A or \G anchoring is also better because you get
better syntax checking of your input - not ignoring what comes
before the thing you want.

It would be nice if there was some option in many of the regexp
matching methods to implicitly anchor your regexp. About the
only ones that do this is what is in StringScanner.



=09
____________________________________________________
Start your day with Yahoo! - make it your home page=20
http://www.yahoo.com/r/hs=20
=20
 
H

Hugh Sasse

Wow, lots of good stuff already. I'm collecting these on the web
(hope that's OK, attributed, slightly edited for brevity, no email
addresses)...
The page, which needs the formatting tweaking some more, is

http://www.eng.cse.dmu.ac.uk/~hgs/ruby/performance/

as yet not linked into my ruby page, but Google will probably find
it soon anyway...

Things I'd like to comment on....

- don't write explicit IO : use mmap and let the kernel worry about it

I'm not familiar enough with mmap to understand this one :)
- compose things out of array, hashes, and strings whenever possible.
these
objects are pretty optimized in ruby and serialize great as yaml meaning
you never need to write hand parsers.

Do you mean, as opposed to creating new classes and composing with
those?
- cache anything possible, for instance methods dynamically generated from
method_missing : permanmently add the method to the object instead of
generating it everytime

I've tried to do this and found performance has degraded. I'm not
sure why, but think this post on Redhanded

http://redhanded.hobix.com/inspect/theFullyUpturnedBin.html

(unfortunately mauled by poker spammers, <seethe/>), referencing

http://whytheluckystiff.net/articles/theFullyUpturnedBin.html

has something to do with it, and because of "Grab things in 8MB
chunks" in particular. One case that stood out was caching results
of floating point calculations involving roots: caching was slower.
now, i'm not saying these are 'right' - just that they are what i think of
when trying to make ruby fast.

cheers.

-a

Thank you, All,
Hugh
 
N

Nikolai Weibull

i don't think it is an overstatement. i'm not saying every regex w/o
anchor __is__ O(N^M) - but that they easily __degrade__ to that worst
case more easily w/o them. a single misplaced '.*' can do this. one
time i was examining a processing machine that uses regexes to match
keys of items that arrive over the network and are inserted into a
giant in memory circular queue
(http://my.unidata.ucar.edu/content/software/ldm/archive/index.html)
because it's cpu usage had shot up from around 5 to 50% for no
apparent reason. the keys were not large (10-100 bytes) and the
regexes ldm is configured with tend to be pretty simple, things like

/F10.*OLS/ /F12.*OLS/

and these are matched against strings like

F10200008100242.OLS

Well, what you said was that "performance degrades exponentially
otherwise".

I don't have your whole dataset, but it looks to me that you could be
using a wild-card matcher instead. Another good solution would be to
not use Ruby's built-in regular expression matcher, as it's implemented
using backtracking, which, you are correct, has a tendency to go haywire
far too often. That's the price you pay for having crap like
backreferencing and look-around I'm afraid.

(Sidenote: I'm releasing a regular expression matcher for Ruby in a near
future that is quite nice.)

Anyway, your advice is sound; the more information you give a
backtracking NFA implementation, the better,
nikolai
 
A

Ara.T.Howard

(Sidenote: I'm releasing a regular expression matcher for Ruby in a near
future that is quite nice.)

Anyway, your advice is sound; the more information you give a
backtracking NFA implementation, the better,

can i assume then, that yours is DFA?

cheers.

-a
--
===============================================================================
| email :: ara [dot] t [dot] howard [at] noaa [dot] gov
| phone :: 303.497.6469
| Your life dwells amoung the causes of death
| Like a lamp standing in a strong breeze. --Nagarjuna
===============================================================================
 
H

Hugh Sasse

I seem to be finding that using Struct::Thingy.new is slower than
Thingy.new where Thingy is a class with attr_accessor :this,
:that,...

Hugh
 
A

Ara.T.Howard

Wow, lots of good stuff already. I'm collecting these on the web
(hope that's OK, attributed, slightly edited for brevity, no email
addresses)... The page, which needs the formatting tweaking some more, is

http://www.eng.cse.dmu.ac.uk/~hgs/ruby/performance/
cool.


I'm not familiar enough with mmap to understand this one :)

guy's abstraction is powerful, you just use it like a string:

harp:~ > cat a.rb
require 'mmap'
require 'tempfile'
#
# generate a test file
#
t = Tempfile::new Process::pid.to_s
t.puts 'foobar'
t.close

#
# modify it in place - letting the kernel do the IO
#
m = Mmap::new tmp.path, 'rw', Mmap::MAP_SHARED
m.gsub! %r/foobar/, 'barfoo'
m.msync
m.munmap

#
# show it
#
p IO::read(t.path)

harp:~ > ruby a.rb
"barfoo\n"

it's extremely powerful when you are changing, say, a scanline of pixels in the
middle of a 40MB image but nothing else - the kernel will only read/write the
blocks you need.
Do you mean, as opposed to creating new classes and composing with those?

sometimes ;-) i've just been suprised by the performance of ruby's built-ins
lately. for instance, rbtree used to by tons faster for big maps, but hashes
don't do too badly now. built-in strings share storage sometimes, etc.
I've tried to do this and found performance has degraded. I'm not
sure why, but think this post on Redhanded

http://redhanded.hobix.com/inspect/theFullyUpturnedBin.html

(unfortunately mauled by poker spammers, <seethe/>), referencing

http://whytheluckystiff.net/articles/theFullyUpturnedBin.html

has something to do with it, and because of "Grab things in 8MB
chunks" in particular. One case that stood out was caching results
of floating point calculations involving roots: caching was slower.

sure. it's just guideline - i've run up against similar things too. can't
avoid testing in the end ;-)

-a
--
===============================================================================
| email :: ara [dot] t [dot] howard [at] noaa [dot] gov
| phone :: 303.497.6469
| Your life dwells amoung the causes of death
| Like a lamp standing in a strong breeze. --Nagarjuna
===============================================================================
 
H

Hugh Sasse


Thank you. I've still not tidied it up yet.
I'm not familiar enough with mmap to understand this one :)

guy's abstraction is powerful, you just use it like a string:

harp:~ > cat a.rb
require 'mmap'
require 'tempfile'
[...]
#
# modify it in place - letting the kernel do the IO
#
m = Mmap::new tmp.path, 'rw', Mmap::MAP_SHARED
m.gsub! %r/foobar/, 'barfoo'
m.msync
m.munmap
Nice.
[...]

harp:~ > ruby a.rb
"barfoo\n"

it's extremely powerful when you are changing, say, a scanline of pixels in
the
middle of a 40MB image but nothing else - the kernel will only read/write the
blocks you need.

That's a great help. I'll have to look into that, not that I do
much image processing these days.
sometimes ;-) i've just been suprised by the performance of ruby's built-ins
lately. for instance, rbtree used to by tons faster for big maps, but hashes
don't do too badly now. built-in strings share storage sometimes, etc.

Yes, I'd like to write some test cases for this exploration at some
point, so as future ruby versions appear we can do regression tests
on what is claimed to be quickest.
sure. it's just guideline - i've run up against similar things too. can't
avoid testing in the end ;-)

Agreed. I'm just wondering if the heuristic should be to cache
things of the order of megabytes, because having about 8 of them
will "trip the switch"? I've not read the code and don't know
enought about GC to say much that's meaningful.

Thank you,
Hugh
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,768
Messages
2,569,575
Members
45,053
Latest member
billing-software

Latest Threads

Top