String.split

T

Tom Danielsen

While it works as documented in Pickaxe ( If pattern is omitted, the
value of $; is used. If $; is nil (which is the default), str is split
on whitespace as if ` ' were specified. ) I do find this behaviour
somewhat surprising:

irb(main):004:0> "a b".split(" ")
=> ["a", "b"]
irb(main):005:0> "a\tb".split(" ")
=> ["a", "b"]
irb(main):006:0> "a b".split(/ /)
=> ["a", "b"]
irb(main):007:0> "a\tb".split(/ /)
=> ["a\tb"]
irb(main):008:0>

I think "a\tb".split(" ") => ["a", "b"] is quite counterintuitive...

% ruby -v
ruby 1.8.1 (2003-12-25) [i386-freebsd5.1]
%

regards,
Tom
 
L

Lloyd Zusman

Tom Danielsen said:
While it works as documented in Pickaxe ( If pattern is omitted, the
value of $; is used. If $; is nil (which is the default), str is split
on whitespace as if ` ' were specified. ) I do find this behaviour
somewhat surprising:

irb(main):004:0> "a b".split(" ")
=> ["a", "b"]
irb(main):005:0> "a\tb".split(" ")
=> ["a", "b"]
irb(main):006:0> "a b".split(/ /)
=> ["a", "b"]
irb(main):007:0> "a\tb".split(/ /)
=> ["a\tb"]
irb(main):008:0>

I think "a\tb".split(" ") => ["a", "b"] is quite counterintuitive...

This case follows the convention in Perl, where a split pattern of " "
(one explicit space, not in the form of a regexp) is a special case
which means to split on any occurrence of one or more whitespace
characters, ignoring any leading whitespace.

We also have this:

irb(main):001:0> " a b".split(" ")
=> ["a", "b"]
irb(main):002:0> " a b".split(/ /)
=> ["", "a", "b"]
irb(main):003:0> "\ta\tb".split(" ")
=> ["a", "b"]
irb(main):004:0> "\ta\tb".split(/\t/)
=> ["", "a", "b"]

It's a common occurrence to want to split lines that have fields
separated by arbitrary whitespace characters, and to ignore any leading
whitespace. This usage of split() does that quite nicely.

This convention was almost certainly adopted deliberately, in order to
be consistent with some of the semantics of Perl's split() function.
Although it may seem counter-intuitive to people without prior Perl
experience, it's a very familiar construct for those who have been
working in Perl for a long time.

Just think of split(" ") as a special case which performs a very useful
function.

% ruby -v
ruby 1.8.1 (2003-12-25) [i386-freebsd5.1]
%

regards,
Tom
 
G

Gavin Kistner

This convention was almost certainly adopted deliberately, in order to
be consistent with some of the semantics of Perl's split() function.
Although it may seem counter-intuitive to people without prior Perl
experience, it's a very familiar construct for those who have been
working in Perl for a long time.

Just think of split(" ") as a special case which performs a very useful
function.

Ick.

Not at your summary, Lloyd, but at this situation. This is...stupid.
I don't know what else to call it.

It's a non-sensical idiom, sure to bite more than a few people. It's
like Ruby implemented the behavior of a bug that Perl people have
gotten used to relying on.

What possible benefit is there to typing split(" ") vs. split(/\s/)?
One saved character (but two shift key presses!)?

It is counter-intuitive to people without prior Perl experience. Now
that Ruby is taking off in its own right, does Ruby need to continue
supporting gross global $ vars, this, and other ugly Perl-isms just to
try and make Ruby feel more like Perl?
 
C

Cameron McBride

What possible benefit is there to typing split(" ") vs. split(/\s/)?
One saved character (but two shift key presses!)?

they are not the same:

irb(main):001:0> s = "this is\tfun \tno?"
=> ["this", "is", "fun", "no?"]
irb(main):003:0> s.split(" ")
=> "this is\tfun \tno?"
irb(main):002:0> s.split(/\s/)
=> ["this", "is", "fun", "", "no?"]
It is counter-intuitive to people without prior Perl experience. Now
that Ruby is taking off in its own right, does Ruby need to continue
supporting gross global $ vars, this, and other ugly Perl-isms just to
try and make Ruby feel more like Perl?

Well, things are the way they are. Ruby has over 10 yrs behind it.
I, for one, would like to see less sweeping changes that causes
breakage, not more.

Cameron
 
C

Cameron McBride

Stupid webinterface. paste got mangled. apologizes.

irb(main):001:0> s = "this is\tfun \tno?"
=> "this is\tfun \tno?"
irb(main):002:0> s.split(/\s/)
=> ["this", "is", "fun", "", "no?"]
irb(main):003:0> s.split(" ")
=> ["this", "is", "fun", "no?"]


Cameron
 
B

Bill Kelly

Hi,

From: "Gavin Kistner said:
Ick.

Not at your summary, Lloyd, but at this situation. This is...stupid.
I don't know what else to call it.

It's a non-sensical idiom, sure to bite more than a few people. It's
like Ruby implemented the behavior of a bug that Perl people have
gotten used to relying on.

What possible benefit is there to typing split(" ") vs. split(/\s/)?
One saved character (but two shift key presses!)?

They aren't the same. I agree that having a special case
feels funky... But split(" ") embodies functionality that's
not as easy to duplicate as /\s/ . For instance:
=> ["a", "b", "c"]
=> ["", "", "", "a", "", "", "", "b", "", "", "", "c"]
=> ["", "a", "b", "c"]

Even with /\s+/ we're getting a leading empty field that
the " " special case eliminates for us.

I've never been sure how to write a regexp for split that
does what " " does. I keep thinking it'd need a variable-
width negative lookbehind assertion... which I don't think
even Perl's regex engine supports... Something like:

/(?<!^\s+)\s+/ ...uh....

...Maybe there's another way to do it... If anybody knows
I'd like to learn...
It is counter-intuitive to people without prior Perl experience. Now
that Ruby is taking off in its own right, does Ruby need to continue
supporting gross global $ vars, this, and other ugly Perl-isms just to
try and make Ruby feel more like Perl?

Some are Perl-isms, some are Shell-isms. They're fantastic
for one-liners... If Ruby was neutered to be lousy for one-
liners, I'd be thoroughly bummed . . .


Regards,

Bill
 
C

Chris Dutton

Bill said:
I've never been sure how to write a regexp for split that
does what " " does. I keep thinking it'd need a variable-
width negative lookbehind assertion... which I don't think
even Perl's regex engine supports... Something like:

/(?<!^\s+)\s+/ ...uh....

..Maybe there's another way to do it... If anybody knows
I'd like to learn...

Not that I dislike the behavior of split(" "), but it shouldn't be much
harder than:

" a b c d ".strip.split(/\s+/)
 
M

Mark Hubbart

Even with /\s+/ we're getting a leading empty field that
the " " special case eliminates for us.

I've never been sure how to write a regexp for split that
does what " " does. I keep thinking it'd need a variable-
width negative lookbehind assertion... which I don't think
even Perl's regex engine supports... Something like:

/(?<!^\s+)\s+/ ...uh....

...Maybe there's another way to do it... If anybody knows
I'd like to learn...

as of now, this works:

irb(main):001:0> " spaces of doom ".split(nil)
=> ["spaces", "of", "doom"]

Why shouldn't nil be the only special case? If the $variable is set to
nil, it uses this kind of split anyway.

And it's no more characters to type than " " ;)

Mark
 
R

Robert Klemme

Cameron McBride said:
What possible benefit is there to typing split(" ") vs. split(/\s/)?
One saved character (but two shift key presses!)?

they are not the same:

irb(main):001:0> s = "this is\tfun \tno?"
=> ["this", "is", "fun", "no?"]
irb(main):003:0> s.split(" ")
=> "this is\tfun \tno?"
irb(main):002:0> s.split(/\s/)
=> ["this", "is", "fun", "", "no?"]

I'd rather compare split(" ") to split(/\s+/), which is what I use when I
need this functionality. IMHO regular expressions are better suited to
this task anyway. And they are faster:

def test1(s) s.split ' ' end

def test2(s) s.split /\s+/ end

foo = (1..100).to_a.join " "

1000.times { test1 foo }
1000.times { test2 foo }

Yields

11:12:47 [source]: /c/temp/split-perf.rb
% cumulative self self total
time seconds seconds calls ms/call ms/call name
61.58 0.62 0.62 2000 0.31 0.31 String#split
15.37 0.78 0.16 1000 0.16 0.39 Object#test1
13.79 0.92 0.14 2 70.00 507.50 Integer#times
9.26 1.02 0.09 1000 0.09 0.48 Object#test2
3.05 1.05 0.03 1 31.00 31.00
Profiler__.start_profile
0.00 1.05 0.00 2 0.00 0.00 Module#method_added
0.00 1.05 0.00 100 0.00 0.00 Fixnum#to_s
0.00 1.05 0.00 1 0.00 0.00 Enumerable.to_a
0.00 1.05 0.00 1 0.00 1015.00 #toplevel
0.00 1.05 0.00 1 0.00 0.00 Array#join
0.00 1.05 0.00 1 0.00 0.00 Range#each

Which shows that the regexp version is faster. I assume, the string is
converted into a regexp internally and that this is done on each
invocation, while there are definitely optimizations for recurring regexp
usage.

Regards

robert
 
L

Lloyd Zusman

Mark Hubbart said:
[ ... ]

/(?<!^\s+)\s+/ ...uh....

...Maybe there's another way to do it... If anybody knows
I'd like to learn...

as of now, this works:

irb(main):001:0> " spaces of doom ".split(nil)
=> ["spaces", "of", "doom"]

Why shouldn't nil be the only special case? If the $variable is set to
nil, it uses this kind of split anyway.

And it's no more characters to type than " " ;)

Mark

... and the following has even fewer characters to type:

irb(main):001:0> " spaces of doom ".split()
=> ["spaces", "of", "doom"]
 
L

Lloyd Zusman

Robert Klemme said:
Cameron McBride said:
What possible benefit is there to typing split(" ") vs. split(/\s/)?
One saved character (but two shift key presses!)?

they are not the same:

irb(main):001:0> s = "this is\tfun \tno?"
=> ["this", "is", "fun", "no?"]
irb(main):003:0> s.split(" ")
=> "this is\tfun \tno?"
irb(main):002:0> s.split(/\s/)
=> ["this", "is", "fun", "", "no?"]

I'd rather compare split(" ") to split(/\s+/), which is what I use when I
need this functionality. [ ... ]

However, the two cases are not equivalent:

irb(main):001:0> " spaces of doom ".split(/\s+/)
=> ["", "spaces", "of", "doom"]
irb(main):002:0> " spaces of doom ".split(" ")
=> ["spaces", "of", "doom"]

You'd have to compare split(" ") with strip.split(/\s+/). I'll do that
later this morning, when I have more time, and I'll then post my
results.
 
R

Robert Klemme

Lloyd Zusman said:
Robert Klemme said:
Cameron McBride said:
What possible benefit is there to typing split(" ") vs. split(/\s/)?
One saved character (but two shift key presses!)?

they are not the same:

irb(main):001:0> s = "this is\tfun \tno?"
=> ["this", "is", "fun", "no?"]
irb(main):003:0> s.split(" ")
=> "this is\tfun \tno?"
irb(main):002:0> s.split(/\s/)
=> ["this", "is", "fun", "", "no?"]

I'd rather compare split(" ") to split(/\s+/), which is what I use when I
need this functionality. [ ... ]

However, the two cases are not equivalent:

irb(main):001:0> " spaces of doom ".split(/\s+/)
=> ["", "spaces", "of", "doom"]
irb(main):002:0> " spaces of doom ".split(" ")
=> ["spaces", "of", "doom"]

You'd have to compare split(" ") with strip.split(/\s+/). I'll do that
later this morning, when I have more time, and I'll then post my
results.

You're right, the strip makes

def test1(s) s.split ' ' end
def test2(s) s.split /\s+/ end
def test3(s) s.strip.split /\s+/ end
def test4(s) s.sub(/^\s+/, '').split /\s+/ end

foo = (1..100).to_a.join " "

1000.times { test1 foo }
1000.times { test2 foo }
1000.times { test3 foo }
1000.times { test4 foo }


12:26:30 [ruby]: ./split-perf.rb
% cumulative self self total
time seconds seconds calls ms/call ms/call name
56.61 1.36 1.36 4000 0.34 0.34 String#split
10.56 1.62 0.25 4 63.50 597.75 Integer#times
8.89 1.83 0.21 1000 0.21 0.21 String#sub
8.40 2.03 0.20 1000 0.20 0.55 Object#test2
7.15 2.20 0.17 1000 0.17 0.65 Object#test4
5.15 2.33 0.12 1000 0.12 0.39 Object#test1
2.62 2.39 0.06 1000 0.06 0.55 Object#test3
1.29 2.42 0.03 1 31.00 31.00
Profiler__.start_profile
0.00 2.42 0.00 1 0.00 2406.00 #toplevel
0.00 2.42 0.00 100 0.00 0.00 Fixnum#to_s
0.00 2.42 0.00 1000 0.00 0.00 String#strip
0.00 2.42 0.00 4 0.00 0.00 Module#method_added
0.00 2.42 0.00 1 0.00 0.00 Range#each
0.00 2.42 0.00 1 0.00 0.00 Array#join
0.00 2.42 0.00 1 0.00 0.00 Enumerable.to_a



def test1(s) s.split ' ' end
def test2(s) s.split /\s+/ end
def test3(s) s.strip.split /\s+/ end
def test4(s) s.sub(/^\s+/, '').split /\s+/ end

foo = " " + (1..100).to_a.join( " " )

1000.times { test1 foo }
1000.times { test2 foo }
1000.times { test3 foo }
1000.times { test4 foo }


12:27:36 [ruby]: ./split-perf.rb
% cumulative self self total
time seconds seconds calls ms/call ms/call name
51.03 1.26 1.26 4000 0.32 0.32 String#split
12.84 1.58 0.32 1000 0.32 0.65 Object#test3
10.13 1.83 0.25 4 62.50 613.50 Integer#times
8.30 2.03 0.20 1000 0.20 0.39 Object#test1
7.09 2.21 0.17 1000 0.17 0.64 Object#test4
6.97 2.38 0.17 1000 0.17 0.52 Object#test2
1.82 2.42 0.05 1000 0.05 0.05 String#sub
1.26 2.46 0.03 1 31.00 31.00
Profiler__.start_profile
1.22 2.49 0.03 1000 0.03 0.03 String#strip
0.61 2.50 0.01 1 15.00 15.00 Enumerable.to_a
0.00 2.50 0.00 1 0.00 0.00 String#+
0.00 2.50 0.00 100 0.00 0.00 Fixnum#to_s
0.00 2.50 0.00 1 0.00 2469.00 #toplevel
0.00 2.50 0.00 1 0.00 0.00 Range#each
0.00 2.50 0.00 1 0.00 0.00 Array#join
0.00 2.50 0.00 4 0.00 0.00 Module#method_added


Performance ranking depends on whether there are leading spaces or not.

robert
 
L

Lloyd Zusman

Robert Klemme said:
Lloyd Zusman said:
Robert Klemme said:
What possible benefit is there to typing split(" ") vs. split(/\s/)?
One saved character (but two shift key presses!)?

they are not the same:

irb(main):001:0> s = "this is\tfun \tno?"
=> ["this", "is", "fun", "no?"]
irb(main):003:0> s.split(" ")
=> "this is\tfun \tno?"
irb(main):002:0> s.split(/\s/)
=> ["this", "is", "fun", "", "no?"]

I'd rather compare split(" ") to split(/\s+/), which is what I use when I
need this functionality. [ ... ]

However, the two cases are not equivalent:

irb(main):001:0> " spaces of doom ".split(/\s+/)
=> ["", "spaces", "of", "doom"]
irb(main):002:0> " spaces of doom ".split(" ")
=> ["spaces", "of", "doom"]

You'd have to compare split(" ") with strip.split(/\s+/). I'll do that
later this morning, when I have more time, and I'll then post my
results.

You're right, the strip makes

[ ... etc. ... ]

Well, you saved me some time by running these yourself. Thanks.

Hmm ... if you know for sure ahead of time whether or not there's
leading whitespace, split(' ') is not the best.

However, without this knowledge about the existence of leading
whitespace or lack thereof, I believe that the best bet is still
split(' ') and its cousins split(nil) and split().

Using a random number of spaces between the items and a random amount of
leading whitespace (including none), I got the following results. Note
that the split(' ')/split(nil)/split() cases are the fastest ones when
you leave out the split(/\s+/) case. That one should really be left out
of these random whitespace tests, because it doesn't give the same
results as the others.

testArray = []

1000.times {
string = ''
(1..100).each { |x| string += ((" " * rand(3)) + x.to_s) }
testArray << string;
}

require 'profile'

def test1(s) s.split(' ') end
def test2(s) s.split(nil) end
def test3(s) s.split() end
def test4(s) s.split(/\s+/) end
def test5(s) s.strip.split(/\s+/) end
def test6(s) s.sub(/^\s+/, '').split(/\s+/) end

testArray.each { |x| test1(x) }
testArray.each { |x| test2(x) }
testArray.each { |x| test3(x) }
testArray.each { |x| test4(x) }
testArray.each { |x| test5(x) }
testArray.each { |x| test6(x) }

% cumulative self self total
time seconds seconds calls ms/call ms/call name
33.17 3.80 3.80 6000 0.63 0.63 String#split
31.54 7.42 3.62 1 3617.19 3617.19
Profiler__.start_profile
24.59 10.24 2.82 6 470.05 1911.46 Array#each
8.17 11.18 0.94 1000 0.94 1.66 Object#test5
6.68 11.95 0.77 1000 0.77 1.93 Object#test6
6.34 12.67 0.73 1000 0.73 1.20 Object#test1
6.27 13.39 0.72 1000 0.72 1.60 Object#test4
6.27 14.11 0.72 1000 0.72 1.13 Object#test3
5.18 14.70 0.59 1000 0.59 1.12 Object#test2
2.18 14.95 0.25 1000 0.25 0.25 String#sub
1.16 15.09 0.13 1000 0.13 0.13 String#strip
0.00 15.09 0.00 6 0.00 0.00
Module#method_added
0.00 15.09 0.00 1 0.00 11468.75 #toplevel
 
M

Mark Hubbart

Mark Hubbart said:
[ ... ]

/(?<!^\s+)\s+/ ...uh....

...Maybe there's another way to do it... If anybody knows
I'd like to learn...

as of now, this works:

irb(main):001:0> " spaces of doom ".split(nil)
=> ["spaces", "of", "doom"]

Why shouldn't nil be the only special case? If the $variable is set to
nil, it uses this kind of split anyway.

And it's no more characters to type than " " ;)

Mark

... and the following has even fewer characters to type:

irb(main):001:0> " spaces of doom ".split()
=> ["spaces", "of", "doom"]

... which is the same as:

irb(main):001:0> " spaces of doom ".split
=> ["spaces", "of", "doom"]

However, I was mistakenly thinking that #split(nil) would be exactly
the same as #split(" ")... but it isn't. I tried setting $; to "." and
it no longer worked. It seems that it should, though: when you want the
default behavior of #split, you set $; to nil. it seems rather logical
that #split(nil) should split using that default behavior. Oh well :)

cheers,
Mark
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,774
Messages
2,569,599
Members
45,162
Latest member
GertrudeMa
Top