RCR 296: Destructive methods return self

D

Daniel Amelang

I know that it's not standard policy to announce RCRs on ruby-talk,
but since this has been a topic of conversation (debate, polemic,
death threats?) lately on this list, I thought it relevant.

RCR 296: Destructive methods return self

http://www.rcrchive.net/rcr/show/296

Let the flame war begin!

Dan
 
D

David A. Black

Hi --

I know that it's not standard policy to announce RCRs on ruby-talk,
but since this has been a topic of conversation (debate, polemic,
death threats?) lately on this list, I thought it relevant.

RCR 296: Destructive methods return self

http://www.rcrchive.net/rcr/show/296

Let the flame war begin!

No war on RCRchive, please.... The 150 or so discussions of this
topic I've been witness to have all been very civil :)

By the way, the ruby-talk.org/###### URL syntax doesn't work any more.
You have to do:
http://www.ruby-talk.org/cgi-bin/scat.rb/ruby/ruby-talk/######

I'll try to fix them directly in the database.


David
 
D

David A. Black

I'll try to fix them directly in the database.

Actually you can do this as a minor edit, as the RCRs author. (Silly
me, I forgot how convenient I made it :)


David
 
D

Daniel Amelang

Update: the RCR is now 297. Sorry, slip of the hand. I really meant
'minor edit', I swear!
 
D

Daniel Amelang

For those interested in an alternative, I just put this up on the RCR:

Perhaps a more interesting solution would be to get rid of all
destructive methods and work on optimizing ther existing
non-destructive counterparts. Matz seems to prefer this solution.

http://ruby-talk.org/cgi-bin/scat.rb/ruby/ruby-talk/17764

Unfortunately, the non-destructive versions are still quite noticeably slower.

If interest towards this approach materializes, I would be happy to
supercede this RCR.

Dan
 
N

Nikolai Weibull

* Daniel Amelang (Mar 19, 2005 23:30):
Perhaps a more interesting solution would be to get rid of all
destructive methods and work on optimizing ther existing
non-destructive counterparts. Matz seems to prefer this solution.

Unfortunately, the non-destructive versions are still quite noticeably
slower.

No they're not. Where are you getting these assertions from?,
nikolai
 
D

Daniel Amelang

No they're not. Where are you getting these assertions from?,

I use ruby in the bioinformatics field. We deal with very long
biological sequence data and protein structures. I'll have to go
through my code and post some examples with corresponding bio data,
but I've already spent enough time today on the RCR. I may have some
examples from my computer graphics class years ago that used sort!
also. You're right for asking for benchmarks, I just have to move on
to other things right now.

So, anyone else want to brew up some examples? I'd be glad to be proved wrong.

And don't forget, unless bang(!) methods are eliminated entirely,
there is still the problem with the confusion (of experienced ruby
programmers).

Dan
 
D

David A. Black

Hi --

Update: the RCR is now 297. Sorry, slip of the hand. I really meant
'minor edit', I swear!

I manually fixed it -- it's 296 again :) (with the comments and
votes from 297).


David
 
D

Daniel Berger

Nikolai said:
* Daniel Amelang (Mar 19, 2005 23:30):

No they're not. Where are you getting these assertions from?,
nikolai

Oh, I dunno. Maybe this benchmark?

The only exception was slice vs. slice!. No idea what happened there.

Here are the results of my benchmark running on a 1.8 GHz P4, 512MB
RAM, on Windows XP, SP2. The actual benchmark code I used it further
below.

C:\eclipse\workspace\ruby-foo\benchmarks>ruby bang.rb
user system total real
capitalize 1.862000 0.000000 1.862000 ( 1.923000)
capitalize! 0.591000 0.000000 0.591000 ( 0.600000)
chomp 1.663000 0.010000 1.673000 ( 1.853000)
chomp! 0.510000 0.000000 0.510000 ( 0.511000)
delete 5.739000 0.030000 5.769000 ( 6.449000)
delete! 4.396000 0.000000 4.396000 ( 4.606000)
downcase 1.712000 0.010000 1.722000 ( 2.294000)
downcase! 0.581000 0.000000 0.581000 ( 0.580000)
gsub 4.296000 0.030000 4.326000 ( 5.188000)
gsub! 0.972000 0.000000 0.972000 ( 1.161000)
lstrip 1.762000 0.020000 1.782000 ( 2.123000)
lstrip! 0.431000 0.000000 0.431000 ( 0.431000)
next 1.813000 0.000000 1.813000 ( 1.933000)
next! 1.331000 0.010000 1.341000 ( 1.472000)
reverse 1.633000 0.010000 1.643000 ( 1.742000)
reverse! 0.380000 0.000000 0.380000 ( 0.501000)
rstrip 1.773000 0.000000 1.773000 ( 1.882000)
rstrip! 0.420000 0.000000 0.420000 ( 0.431000)
slice 1.883000 0.020000 1.903000 ( 2.454000)
slice! 3.415000 0.010000 3.425000 ( 3.765000)
strip 1.793000 0.000000 1.793000 ( 1.812000)
strip! 0.370000 0.000000 0.370000 ( 0.411000)
sub 1.302000 0.000000 1.302000 ( 1.382000)
sub! 0.901000 0.000000 0.901000 ( 1.202000)
swapcase 0.862000 0.000000 0.862000 ( 0.961000)
swapcase! 0.420000 0.000000 0.420000 ( 0.451000)
tr 3.515000 0.000000 3.515000 ( 3.925000)
tr 2.284000 0.000000 2.284000 ( 2.484000)
tr_s 3.545000 0.010000 3.555000 ( 3.845000)
tr_s! 2.193000 0.000000 2.193000 ( 2.303000)
upcase 2.023000 0.000000 2.023000 ( 2.103000)
upcase! 0.631000 0.000000 0.631000 ( 0.651000)

require "benchmark"
include Benchmark

s1 = " hello "
s2 = " hello "
s3 = " hello "
max = 500000

bm do |bench|
bench.report("capitalize"){
max.times{ s1.capitalize }
}

bench.report("capitalize!"){
max.times{ s1.capitalize! }
}

bench.report("chomp"){
max.times{ s1.chomp }
}

bench.report("chomp!"){
max.times{ s1.chomp! }
}

bench.report("delete"){
max.times{ s1.delete("o") }
}

bench.report("delete!"){
max.times{ s1.delete!("o") }
}

bench.report("downcase"){
max.times{ s1.downcase }
}

bench.report("downcase!"){
max.times{ s1.downcase! }
}

bench.report("gsub"){
max.times{ s1.gsub(/e/,"i") }
}

bench.report("gsub!"){
max.times{ s1.gsub!(/e/,"i") }
}

bench.report("lstrip"){
max.times{ s1.lstrip }
}

bench.report("lstrip!"){
max.times{ s1.lstrip! }
}

bench.report("next"){
max.times{ s1.next }
}

bench.report("next!"){
max.times{ s1.next! }
}

bench.report("reverse"){
max.times{ s1.reverse }
}

bench.report("reverse!"){
max.times{ s1.reverse! }
}

bench.report("rstrip"){
max.times{ s1.rstrip }
}

bench.report("rstrip!"){
max.times{ s1.rstrip! }
}

bench.report("slice"){
max.times{ s2.slice(1..4) }
}

bench.report("slice!"){
max.times{ s2.slice!(1..4) }
}

bench.report("strip"){
max.times{ s2.strip }
}

bench.report("strip!"){
max.times{ s2.strip! }
}

bench.report("sub"){
max.times{ s2.sub(/e/,"i") }
}

bench.report("sub!"){
max.times{ s2.sub!(/e/,"i") }
}

bench.report("swapcase"){
max.times{ s2.swapcase }
}

bench.report("swapcase!"){
max.times{ s2.swapcase! }
}

bench.report("tr"){
max.times{ s3.tr("e","a") }
}

bench.report("tr"){
max.times{ s3.tr!("e","a") }
}

bench.report("tr_s"){
max.times{ s3.tr_s("a","e") }
}

bench.report("tr_s!"){
max.times{ s3.tr_s!("a","e") }
}

bench.report("upcase"){
max.times{ s3.upcase }
}

bench.report("upcase!"){
max.times{ s3.upcase! }
}
end

Regards,

Dan
 
N

Nikolai Weibull

* Daniel Berger (Mar 20, 2005 01:10):
Oh, I dunno. Maybe this benchmark?
The only exception was slice vs. slice!. No idea what happened there.

Well, with slice!, you have to do more work than doing a slice. The
reason is that you have to paste together the parts that are still a
part of the string. With slice you simply return the substring.

For me, next! is also slower than next.

A comment on your benchmark: doesn't it make more sense to do the tests
on new strings each time? The benchmark should surely be try to
evaluate the merit of returning a new string versus replacing its
contents, right? With a benchmark that preallocates 'max' copies of
's1' and fetches them from an array for each of the calls to the
respective functions, the times are much more evenly matched:

user system total real
capitalize 0.660000 0.000000 0.660000 ( 0.675759)
capitalize! 0.480000 0.010000 0.490000 ( 0.494108)
chomp 0.660000 0.000000 0.660000 ( 0.680509)
chomp! 0.370000 0.000000 0.370000 ( 0.370647)
delete 1.300000 0.000000 1.300000 ( 1.296738)
delete! 1.070000 0.000000 1.070000 ( 1.068162)
downcase 0.630000 0.000000 0.630000 ( 0.639248)
downcase! 0.400000 0.000000 0.400000 ( 0.399627)
gsub 2.710000 0.020000 2.730000 ( 2.844503)
gsub! 2.250000 0.030000 2.280000 ( 2.331406)
lstrip 0.700000 0.000000 0.700000 ( 0.705452)
lstrip! 0.430000 0.000000 0.430000 ( 0.424257)
next 0.570000 0.000000 0.570000 ( 0.570100)
next! 0.720000 0.000000 0.720000 ( 0.725646)
slice 0.850000 0.000000 0.850000 ( 0.844139)
slice! 1.210000 0.000000 1.210000 ( 1.217413)
reverse 0.510000 0.000000 0.510000 ( 0.514665)
reverse! 0.350000 0.000000 0.350000 ( 0.349400)
rstrip 0.660000 0.000000 0.660000 ( 0.664235)
rstrip! 0.370000 0.000000 0.370000 ( 0.371327)
strip 0.550000 0.000000 0.550000 ( 0.547162)
strip! 0.390000 0.000000 0.390000 ( 0.390382)
sub 0.940000 0.000000 0.940000 ( 0.938963)
sub! 0.690000 0.000000 0.690000 ( 0.690106)
swapcase 0.630000 0.000000 0.630000 ( 0.630045)
swapcase! 0.390000 0.000000 0.390000 ( 0.392977)
tr 0.850000 0.000000 0.850000 ( 0.852794)
tr 0.750000 0.000000 0.750000 ( 0.751016)
tr_s 0.850000 0.000000 0.850000 ( 0.845268)
tr_s! 0.690000 0.000000 0.690000 ( 0.690721)
upcase 0.600000 0.000000 0.600000 ( 0.606707)
upcase! 0.390000 0.000000 0.390000 ( 0.413688)

Anyway, if you don't like this test, here are the times on my machine
(it seems Ruby on a Linux system runs a lot faster than on a Windows
system, and much more evenly between the two versions of each method):

user system total real
capitalize 0.490000 0.000000 0.490000 ( 0.488138)
capitalize! 0.290000 0.000000 0.290000 ( 0.286009)
chomp 0.460000 0.000000 0.460000 ( 0.466617)
chomp! 0.220000 0.000000 0.220000 ( 0.219798)
delete 1.150000 0.000000 1.150000 ( 1.153041)
delete! 0.920000 0.000000 0.920000 ( 0.912380)
downcase 0.490000 0.000000 0.490000 ( 0.489456)
downcase! 0.240000 0.000000 0.240000 ( 0.240895)
gsub 1.250000 0.000000 1.250000 ( 1.254223)
gsub! 0.520000 0.000000 0.520000 ( 0.516004)
lstrip 0.480000 0.000000 0.480000 ( 0.485036)
lstrip! 0.200000 0.000000 0.200000 ( 0.198516)
next 0.410000 0.000000 0.410000 ( 0.406751)
next! 0.430000 0.000000 0.430000 ( 0.430365)
reverse 0.380000 0.000000 0.380000 ( 0.384075)
reverse! 0.190000 0.000000 0.190000 ( 0.193044)
rstrip 0.470000 0.000000 0.470000 ( 0.471682)
rstrip! 0.200000 0.000000 0.200000 ( 0.194479)
slice 0.600000 0.000000 0.600000 ( 0.596786)
slice! 0.960000 0.000000 0.960000 ( 0.961636)
strip 0.470000 0.000000 0.470000 ( 0.469545)
strip! 0.180000 0.000000 0.180000 ( 0.185437)
sub 0.620000 0.000000 0.620000 ( 0.620524)
sub! 0.470000 0.000000 0.470000 ( 0.464609)
swapcase 0.380000 0.000000 0.380000 ( 0.380528)
swapcase! 0.190000 0.000000 0.190000 ( 0.192872)
tr 0.810000 0.000000 0.810000 ( 0.818261)
tr 0.590000 0.000000 0.590000 ( 0.615229)
tr_s 0.850000 0.000000 0.850000 ( 0.858405)
tr_s! 0.590000 0.000000 0.590000 ( 0.597201)
upcase 0.550000 0.000000 0.550000 ( 0.578785)
upcase! 0.250000 0.000000 0.250000 ( 0.252732)

I'm not arguing with the fact that for most methods, the destructive
version will be faster. I'm just saying that the performance difference
between the two isn't that big,
nikolai
 
N

Nikolai Weibull

* Daniel Amelang (Mar 19, 2005 23:50):
I use ruby in the bioinformatics field. We deal with very long
biological sequence data and protein structures.

How long? Are you sure String is the class for you? Is it possible
that some domain-specific data structures would suite your tasks better?
I'm being curious, not provocative,
nikolai
 
G

Glenn Parker

Nikolai said:
Well, with slice!, you have to do more work than doing a slice. The
reason is that you have to paste together the parts that are still a
part of the string. With slice you simply return the substring.

Seems to me that the same logic should apply to gsub vs. gsub!, but it
doesn't quite work out that way. A gsub benchmark using different
lengths of input and output might be more appropriate.

An aside: I do not accept the assertion that Strings should only be used
when they can be kept short. I still expect reasonably good performance
when operating on large pieces of text all at once, e.g. a large file or
some other text-based data such as MIME or XML data.
A comment on your benchmark: doesn't it make more sense to do the tests
on new strings each time?

For the destructive methods, definitely. For the non-destructive
methods, it's irrelevant.
Anyway, if you don't like this test, here are the times on my machine
(it seems Ruby on a Linux system runs a lot faster than on a Windows
system, and much more evenly between the two versions of each method):
I'm not arguing with the fact that for most methods, the destructive
version will be faster. I'm just saying that the performance difference
between the two isn't that big,

I still see a lot of cases in your results where the destructive method
is roughly twice as fast. That is significant.
 
N

Nikolai Weibull

* Glenn Parker (Mar 20, 2005 13:40):
Seems to me that the same logic should apply to gsub vs. gsub!, but it
doesn't quite work out that way. A gsub benchmark using different
lengths of input and output might be more appropriate.

Hm, yeah, that seems reasonable.
An aside: I do not accept the assertion that Strings should only be
used when they can be kept short. I still expect reasonably good
performance when operating on large pieces of text all at once, e.g. a
large file or some other text-based data such as MIME or XML data.

Well, the problem is that a String is represented by one long sequence
of bytes in memory. That will be slow no matter what you do if the
String is very large. (Well, slow may be an overstatement, but it will
become the dominating factor.)
For the destructive methods, definitely. For the non-destructive
methods, it's irrelevant.
Precisely.
I still see a lot of cases in your results where the destructive
method is roughly twice as fast. That is significant.

Not really. It only plays a role for very large iterations and I'd say
that by then you'd probably be better off optimizing something else,
nikolai
 
G

Glenn Parker

Nikolai said:
Well, the problem is that a String is represented by one long sequence
of bytes in memory.

AFAIK, that is not actually a requirement for a Ruby implementation.
 
N

Nikolai Weibull

* Glenn Parker (Mar 20, 2005 16:00):
AFAIK, that is not actually a requirement for a Ruby implementation.

What does that mean? That is how it's implemented in Ruby. There's no
definition of how Ruby should be implemented. The implementation is the
standard. That isn't to say that they are the final word of how a
String should be represented, but if it will, in a future release, be
represented in some other way, perhaps having destructive versus
non-destructive methods will make no sense,
nikolai
 
G

Glenn Parker

Nikolai said:
* Glenn Parker (Mar 20, 2005 16:00):



What does that mean? That is how it's implemented in Ruby. There's no
definition of how Ruby should be implemented.

Sure there is: a Ruby implementation should implement the Ruby language
as currently defined. What I'm saying is that there is nothing in the
definition of the Ruby language that forces an implementor to use
contiguous blocks of memory for strings or arrays. Managing large
strings as linked chunks is something text editors already do for
performance. Assertions about how strings should or should not be used
are based, in part, on assumptions that should be questioned.
The implementation is the standard.

I would rather say:
The most widely used implementation is the de facto standard. :)
That isn't to say that they are the final word of how a
String should be represented, but if it will, in a future release, be
represented in some other way, perhaps having destructive versus
non-destructive methods will make no sense,

I think what I'm suggesting would actually make destructive methods more
appealing, without reducing the object creation overhead for
non-destructive methods.
 
N

Nikolai Weibull

* Glenn Parker (Mar 20, 2005 17:10):
Sure there is: a Ruby implementation should implement the Ruby
language as currently defined. What I'm saying is that there is
nothing in the definition of the Ruby language that forces an
implementor to use contiguous blocks of memory for strings or arrays.

Well, yes there is. You are forgetting the C interface:
RSTRING(str)->ptr is used in a lot of modules.
Managing large strings as linked chunks is something text editors
already do for performance.

It's not a very good representation for a text editor, though. The
lookup method becomes O(n), which is just terrible. Read [1] for an
overview of representing sequences. I have already mentioned Ropes a
couple of times as a possible solution to the problem of dealing with
large strings.
Assertions about how strings should or should not be used are based,
in part, on assumptions that should be questioned.

Yes, but the whole point is that they haven't and it seems that people
don't realize how their implementation limits their use. A String is
simply not well-suited for sequences longer than some certain limit
(depending on system and so on). There are ways to alleviate these
issues. For me, the most obviously good solution would be to add
another sequence data structure that would fit use-cases where long
sequences are expected.
I would rather say: The most widely used implementation is the de
facto standard. :)

Which, in this case, is like saying what I already said.
I think what I'm suggesting would actually make destructive methods
more appealing, without reducing the object creation overhead for
non-destructive methods.

I don't see how. One of the gains of breaking the string up is that
parts of it can be reused in strings based on the original, e.g.,
"abc" + "def" can be implemented as an O(1)-operation where we simply
create a string that points to the two substrings. slice can be
implemented by pointing into the string we are taking a slice of and
slice! can be implemented by breaking the string up and pointing to the
two parts of the string that remain after the deletion. Implementing
slice and slice! in this manner means that they will be equally
efficient, both time- and space-wise.

See [1] and [2] for further discussions on the ideas I have presented
here.

A representation of a sequence as a sequence of continuous bytes in
memory is very efficient for lookups and it's space optimal. Most
operations will perform very efficiently for shorter sequences as well,
but degrades horribly for long sequences. I am not advocating that the
"sequence of continuous bytes" implementation for String be replaced, it
fits very well with many use-cases. What I am advocating is that we
should realize that it doesn't fit every case and that a second sequence
data structure should be added for dealing with the cases where a String
isn't optimal. I feel that most of the destructive methods on Strings
are optimizing the wrong problem. It's not excessive copying that
should be avoided, but excessive computation.

My solution will only solve the destructive vs. non-destructive problem
for those methods that deal with substring of the original string, e.g.,
concat, chomp, delete, gsub, lstrip, rstrip, slice, strip, sub. For
capitalize, downcase, swapcase, tr, tr_s, and upcase, one would still
have to create a new copy. I wonder, though, how often the difference
between capitalize and capitalize! in execution speed actually makes a
difference,
nikolai

[1] http://citeseer.ist.psu.edu/crowley98data.html
[2] http://rubyurl.com/rubyurl/show/2FRbO (PDF)
 
E

ES

Forgot to respond to the previous post.
Apologies on my part, as well.

In data 3/20/2005, "Nikolai Weibull"
* Glenn Parker (Mar 20, 2005 17:10):


Well, yes there is. You are forgetting the C interface:
RSTRING(str)->ptr is used in a lot of modules.

YARV could be considered an implementation of ruby, so it's
not _necessarily_ the same although this is just a semantic
point.
Managing large strings as linked chunks is something text editors
already do for performance.

It's not a very good representation for a text editor, though. The
lookup method becomes O(n), which is just terrible. Read [1] for an
overview of representing sequences. I have already mentioned Ropes a
couple of times as a possible solution to the problem of dealing with
large strings.
Assertions about how strings should or should not be used are based,
in part, on assumptions that should be questioned.

Yes, but the whole point is that they haven't and it seems that people
don't realize how their implementation limits their use. A String is
simply not well-suited for sequences longer than some certain limit
(depending on system and so on). There are ways to alleviate these
issues. For me, the most obviously good solution would be to add
another sequence data structure that would fit use-cases where long
sequences are expected.

Ropes are used fairly widely and would be a good solution to this
problem. In addition, I see no reason why Strings and Ropes could
not use the same idiom as Fixnum and Bignum, where the former is
automatically converted (but the latter may also be explicitly
instantiated).
I would rather say: The most widely used implementation is the de
facto standard. :)

Which, in this case, is like saying what I already said.
I think what I'm suggesting would actually make destructive methods
more appealing, without reducing the object creation overhead for
non-destructive methods.

I don't see how. One of the gains of breaking the string up is that
parts of it can be reused in strings based on the original, e.g.,
"abc" + "def" can be implemented as an O(1)-operation where we simply
create a string that points to the two substrings. slice can be
implemented by pointing into the string we are taking a slice of and
slice! can be implemented by breaking the string up and pointing to the
two parts of the string that remain after the deletion. Implementing
slice and slice! in this manner means that they will be equally
efficient, both time- and space-wise.

See [1] and [2] for further discussions on the ideas I have presented
here.

A representation of a sequence as a sequence of continuous bytes in
memory is very efficient for lookups and it's space optimal. Most
operations will perform very efficiently for shorter sequences as well,
but degrades horribly for long sequences. I am not advocating that the
"sequence of continuous bytes" implementation for String be replaced, it
fits very well with many use-cases. What I am advocating is that we
should realize that it doesn't fit every case and that a second sequence
data structure should be added for dealing with the cases where a String
isn't optimal. I feel that most of the destructive methods on Strings
are optimizing the wrong problem. It's not excessive copying that
should be avoided, but excessive computation.

My solution will only solve the destructive vs. non-destructive problem
for those methods that deal with substring of the original string, e.g.,
concat, chomp, delete, gsub, lstrip, rstrip, slice, strip, sub. For
capitalize, downcase, swapcase, tr, tr_s, and upcase, one would still
have to create a new copy. I wonder, though, how often the difference
between capitalize and capitalize! in execution speed actually makes a
difference,

Interesting... I certainly see the validity of your call to optimize
the nondestructive operations; I'm sure the performance can be reasonably
close to destructive when properly designed. However, it will still
have an impact in _some_ programs even when we're dealing with these
minimal differences.

And I still assert that it's conceptually better to s.chomp! than
to s = s.chomp :)

E
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,065
Latest member
OrderGreenAcreCBD

Latest Threads

Top