String.split

Tom Danielsen · Jul 14, 2004

While it works as documented in Pickaxe ( If pattern is omitted, the
value of $; is used. If $; is nil (which is the default), str is split
on whitespace as if ` ' were specified. ) I do find this behaviour
somewhat surprising:

irb(main):004:0> "a b".split(" ")
=> ["a", "b"]
irb(main):005:0> "a\tb".split(" ")
=> ["a", "b"]
irb(main):006:0> "a b".split(/ /)
=> ["a", "b"]
irb(main):007:0> "a\tb".split(/ /)
=> ["a\tb"]
irb(main):008:0>

I think "a\tb".split(" ") => ["a", "b"] is quite counterintuitive...

% ruby -v
ruby 1.8.1 (2003-12-25) [i386-freebsd5.1]
%

regards,
Tom

Lloyd Zusman · Jul 14, 2004

Tom Danielsen said:
While it works as documented in Pickaxe ( If pattern is omitted, the
value of $; is used. If $; is nil (which is the default), str is split
on whitespace as if ` ' were specified. ) I do find this behaviour
somewhat surprising:

irb(main):004:0> "a b".split(" ")
=> ["a", "b"]
irb(main):005:0> "a\tb".split(" ")
=> ["a", "b"]
irb(main):006:0> "a b".split(/ /)
=> ["a", "b"]
irb(main):007:0> "a\tb".split(/ /)
=> ["a\tb"]
irb(main):008:0>

I think "a\tb".split(" ") => ["a", "b"] is quite counterintuitive...

This case follows the convention in Perl, where a split pattern of " "
(one explicit space, not in the form of a regexp) is a special case
which means to split on any occurrence of one or more whitespace
characters, ignoring any leading whitespace.

We also have this:

irb(main):001:0> " a b".split(" ")
=> ["a", "b"]
irb(main):002:0> " a b".split(/ /)
=> ["", "a", "b"]
irb(main):003:0> "\ta\tb".split(" ")
=> ["a", "b"]
irb(main):004:0> "\ta\tb".split(/\t/)
=> ["", "a", "b"]

It's a common occurrence to want to split lines that have fields
separated by arbitrary whitespace characters, and to ignore any leading
whitespace. This usage of split() does that quite nicely.

This convention was almost certainly adopted deliberately, in order to
be consistent with some of the semantics of Perl's split() function.
Although it may seem counter-intuitive to people without prior Perl
experience, it's a very familiar construct for those who have been
working in Perl for a long time.

Just think of split(" ") as a special case which performs a very useful
function.

% ruby -v
ruby 1.8.1 (2003-12-25) [i386-freebsd5.1]
%

regards,
Tom

Gavin Kistner · Jul 14, 2004

This convention was almost certainly adopted deliberately, in order to
be consistent with some of the semantics of Perl's split() function.
Although it may seem counter-intuitive to people without prior Perl
experience, it's a very familiar construct for those who have been
working in Perl for a long time.

Just think of split(" ") as a special case which performs a very useful
function.

Ick.

Not at your summary, Lloyd, but at this situation. This is...stupid.
I don't know what else to call it.

It's a non-sensical idiom, sure to bite more than a few people. It's
like Ruby implemented the behavior of a bug that Perl people have
gotten used to relying on.

What possible benefit is there to typing split(" ") vs. split(/\s/)?
One saved character (but two shift key presses!)?

It is counter-intuitive to people without prior Perl experience. Now
that Ruby is taking off in its own right, does Ruby need to continue
supporting gross global $ vars, this, and other ugly Perl-isms just to
try and make Ruby feel more like Perl?

Cameron McBride · Jul 14, 2004

What possible benefit is there to typing split(" ") vs. split(/\s/)?

One saved character (but two shift key presses!)?

they are not the same:

irb(main):001:0> s = "this is\tfun \tno?"
=> ["this", "is", "fun", "no?"]
irb(main):003:0> s.split(" ")
=> "this is\tfun \tno?"
irb(main):002:0> s.split(/\s/)
=> ["this", "is", "fun", "", "no?"]

It is counter-intuitive to people without prior Perl experience. Now
that Ruby is taking off in its own right, does Ruby need to continue
supporting gross global $ vars, this, and other ugly Perl-isms just to
try and make Ruby feel more like Perl?

Well, things are the way they are. Ruby has over 10 yrs behind it.
I, for one, would like to see less sweeping changes that causes
breakage, not more.

Cameron

Cameron McBride · Jul 14, 2004

Stupid webinterface. paste got mangled. apologizes.

irb(main):001:0> s = "this is\tfun \tno?"
=> "this is\tfun \tno?"
irb(main):002:0> s.split(/\s/)
=> ["this", "is", "fun", "", "no?"]
irb(main):003:0> s.split(" ")
=> ["this", "is", "fun", "no?"]

Cameron

Bill Kelly · Jul 14, 2004

Hi,

From: "Gavin Kistner said:
Ick.

Not at your summary, Lloyd, but at this situation. This is...stupid.
I don't know what else to call it.

It's a non-sensical idiom, sure to bite more than a few people. It's
like Ruby implemented the behavior of a bug that Perl people have
gotten used to relying on.

What possible benefit is there to typing split(" ") vs. split(/\s/)?
One saved character (but two shift key presses!)?

They aren't the same. I agree that having a special case
feels funky... But split(" ") embodies functionality that's
not as easy to duplicate as /\s/ . For instance:
=> ["a", "b", "c"]
=> ["", "", "", "a", "", "", "", "b", "", "", "", "c"]
=> ["", "a", "b", "c"]

Even with /\s+/ we're getting a leading empty field that
the " " special case eliminates for us.

I've never been sure how to write a regexp for split that
does what " " does. I keep thinking it'd need a variable-
width negative lookbehind assertion... which I don't think
even Perl's regex engine supports... Something like:

/(?<!^\s+)\s+/ ...uh....

...Maybe there's another way to do it... If anybody knows
I'd like to learn...

It is counter-intuitive to people without prior Perl experience. Now
that Ruby is taking off in its own right, does Ruby need to continue
supporting gross global $ vars, this, and other ugly Perl-isms just to
try and make Ruby feel more like Perl?

Some are Perl-isms, some are Shell-isms. They're fantastic
for one-liners... If Ruby was neutered to be lousy for one-
liners, I'd be thoroughly bummed . . .

Regards,

Bill

Chris Dutton · Jul 14, 2004

Bill said:
I've never been sure how to write a regexp for split that
does what " " does. I keep thinking it'd need a variable-
width negative lookbehind assertion... which I don't think
even Perl's regex engine supports... Something like:

/(?<!^\s+)\s+/ ...uh....

..Maybe there's another way to do it... If anybody knows
I'd like to learn...

Not that I dislike the behavior of split(" "), but it shouldn't be much
harder than:

" a b c d ".strip.split(/\s+/)

Mark Hubbart · Jul 14, 2004

Even with /\s+/ we're getting a leading empty field that
the " " special case eliminates for us.

I've never been sure how to write a regexp for split that
does what " " does. I keep thinking it'd need a variable-
width negative lookbehind assertion... which I don't think
even Perl's regex engine supports... Something like:

/(?<!^\s+)\s+/ ...uh....

...Maybe there's another way to do it... If anybody knows
I'd like to learn...

as of now, this works:

irb(main):001:0> " spaces of doom ".split(nil)
=> ["spaces", "of", "doom"]

Why shouldn't nil be the only special case? If the $variable is set to
nil, it uses this kind of split anyway.

And it's no more characters to type than " "

Mark

Robert Klemme · Jul 14, 2004

Cameron McBride said:
What possible benefit is there to typing split(" ") vs. split(/\s/)?
One saved character (but two shift key presses!)?

Click to expand...

they are not the same:

irb(main):001:0> s = "this is\tfun \tno?"
=> ["this", "is", "fun", "no?"]
irb(main):003:0> s.split(" ")
=> "this is\tfun \tno?"
irb(main):002:0> s.split(/\s/)
=> ["this", "is", "fun", "", "no?"]

I'd rather compare split(" ") to split(/\s+/), which is what I use when I
need this functionality. IMHO regular expressions are better suited to
this task anyway. And they are faster:

def test1(s) s.split ' ' end

def test2(s) s.split /\s+/ end

foo = (1..100).to_a.join " "

1000.times { test1 foo }
1000.times { test2 foo }

Yields

11:12:47 [source]: /c/temp/split-perf.rb
% cumulative self self total
time seconds seconds calls ms/call ms/call name
61.58 0.62 0.62 2000 0.31 0.31 String#split
15.37 0.78 0.16 1000 0.16 0.39 Object#test1
13.79 0.92 0.14 2 70.00 507.50 Integer#times
9.26 1.02 0.09 1000 0.09 0.48 Object#test2
3.05 1.05 0.03 1 31.00 31.00
Profiler__.start_profile
0.00 1.05 0.00 2 0.00 0.00 Module#method_added
0.00 1.05 0.00 100 0.00 0.00 Fixnum#to_s
0.00 1.05 0.00 1 0.00 0.00 Enumerable.to_a
0.00 1.05 0.00 1 0.00 1015.00 #toplevel
0.00 1.05 0.00 1 0.00 0.00 Array#join
0.00 1.05 0.00 1 0.00 0.00 Range#each

Which shows that the regexp version is faster. I assume, the string is
converted into a regexp internally and that this is done on each
invocation, while there are definitely optimizations for recurring regexp
usage.

Regards

robert

Lloyd Zusman · Jul 14, 2004

Mark Hubbart said:
[ ... ]

/(?<!^\s+)\s+/ ...uh....

...Maybe there's another way to do it... If anybody knows
I'd like to learn...

Click to expand...

as of now, this works:

irb(main):001:0> " spaces of doom ".split(nil)
=> ["spaces", "of", "doom"]

Why shouldn't nil be the only special case? If the $variable is set to
nil, it uses this kind of split anyway.

And it's no more characters to type than " "

Mark

... and the following has even fewer characters to type:

irb(main):001:0> " spaces of doom ".split()
=> ["spaces", "of", "doom"]

Lloyd Zusman · Jul 14, 2004

Robert Klemme said:
Cameron McBride said:

What possible benefit is there to typing split(" ") vs. split(/\s/)?
One saved character (but two shift key presses!)?

Click to expand...

they are not the same:

irb(main):001:0> s = "this is\tfun \tno?"
=> ["this", "is", "fun", "no?"]
irb(main):003:0> s.split(" ")
=> "this is\tfun \tno?"
irb(main):002:0> s.split(/\s/)
=> ["this", "is", "fun", "", "no?"]

Click to expand...

I'd rather compare split(" ") to split(/\s+/), which is what I use when I
need this functionality. [ ... ]

However, the two cases are not equivalent:

irb(main):001:0> " spaces of doom ".split(/\s+/)
=> ["", "spaces", "of", "doom"]
irb(main):002:0> " spaces of doom ".split(" ")
=> ["spaces", "of", "doom"]

You'd have to compare split(" ") with strip.split(/\s+/). I'll do that
later this morning, when I have more time, and I'll then post my
results.

Robert Klemme · Jul 14, 2004

Lloyd Zusman said:
Robert Klemme said:

Cameron McBride said:

What possible benefit is there to typing split(" ") vs. split(/\s/)?
One saved character (but two shift key presses!)?

they are not the same:

irb(main):001:0> s = "this is\tfun \tno?"
=> ["this", "is", "fun", "no?"]
irb(main):003:0> s.split(" ")
=> "this is\tfun \tno?"
irb(main):002:0> s.split(/\s/)
=> ["this", "is", "fun", "", "no?"]

Click to expand...

I'd rather compare split(" ") to split(/\s+/), which is what I use when I
need this functionality. [ ... ]

Click to expand...

However, the two cases are not equivalent:

irb(main):001:0> " spaces of doom ".split(/\s+/)
=> ["", "spaces", "of", "doom"]
irb(main):002:0> " spaces of doom ".split(" ")
=> ["spaces", "of", "doom"]

You'd have to compare split(" ") with strip.split(/\s+/). I'll do that
later this morning, when I have more time, and I'll then post my
results.

You're right, the strip makes

def test1(s) s.split ' ' end
def test2(s) s.split /\s+/ end
def test3(s) s.strip.split /\s+/ end
def test4(s) s.sub(/^\s+/, '').split /\s+/ end

foo = (1..100).to_a.join " "

1000.times { test1 foo }
1000.times { test2 foo }
1000.times { test3 foo }
1000.times { test4 foo }

12:26:30 [ruby]: ./split-perf.rb
% cumulative self self total
time seconds seconds calls ms/call ms/call name
56.61 1.36 1.36 4000 0.34 0.34 String#split
10.56 1.62 0.25 4 63.50 597.75 Integer#times
8.89 1.83 0.21 1000 0.21 0.21 String#sub
8.40 2.03 0.20 1000 0.20 0.55 Object#test2
7.15 2.20 0.17 1000 0.17 0.65 Object#test4
5.15 2.33 0.12 1000 0.12 0.39 Object#test1
2.62 2.39 0.06 1000 0.06 0.55 Object#test3
1.29 2.42 0.03 1 31.00 31.00
Profiler__.start_profile
0.00 2.42 0.00 1 0.00 2406.00 #toplevel
0.00 2.42 0.00 100 0.00 0.00 Fixnum#to_s
0.00 2.42 0.00 1000 0.00 0.00 String#strip
0.00 2.42 0.00 4 0.00 0.00 Module#method_added
0.00 2.42 0.00 1 0.00 0.00 Range#each
0.00 2.42 0.00 1 0.00 0.00 Array#join
0.00 2.42 0.00 1 0.00 0.00 Enumerable.to_a

def test1(s) s.split ' ' end
def test2(s) s.split /\s+/ end
def test3(s) s.strip.split /\s+/ end
def test4(s) s.sub(/^\s+/, '').split /\s+/ end

foo = " " + (1..100).to_a.join( " " )

1000.times { test1 foo }
1000.times { test2 foo }
1000.times { test3 foo }
1000.times { test4 foo }

12:27:36 [ruby]: ./split-perf.rb
% cumulative self self total
time seconds seconds calls ms/call ms/call name
51.03 1.26 1.26 4000 0.32 0.32 String#split
12.84 1.58 0.32 1000 0.32 0.65 Object#test3
10.13 1.83 0.25 4 62.50 613.50 Integer#times
8.30 2.03 0.20 1000 0.20 0.39 Object#test1
7.09 2.21 0.17 1000 0.17 0.64 Object#test4
6.97 2.38 0.17 1000 0.17 0.52 Object#test2
1.82 2.42 0.05 1000 0.05 0.05 String#sub
1.26 2.46 0.03 1 31.00 31.00
Profiler__.start_profile
1.22 2.49 0.03 1000 0.03 0.03 String#strip
0.61 2.50 0.01 1 15.00 15.00 Enumerable.to_a
0.00 2.50 0.00 1 0.00 0.00 String#+
0.00 2.50 0.00 100 0.00 0.00 Fixnum#to_s
0.00 2.50 0.00 1 0.00 2469.00 #toplevel
0.00 2.50 0.00 1 0.00 0.00 Range#each
0.00 2.50 0.00 1 0.00 0.00 Array#join
0.00 2.50 0.00 4 0.00 0.00 Module#method_added

Performance ranking depends on whether there are leading spaces or not.

robert

Lloyd Zusman · Jul 14, 2004

Robert Klemme said:
Lloyd Zusman said:

Robert Klemme said:

What possible benefit is there to typing split(" ") vs. split(/\s/)?
One saved character (but two shift key presses!)?

they are not the same:

irb(main):001:0> s = "this is\tfun \tno?"
=> ["this", "is", "fun", "no?"]
irb(main):003:0> s.split(" ")
=> "this is\tfun \tno?"
irb(main):002:0> s.split(/\s/)
=> ["this", "is", "fun", "", "no?"]

I'd rather compare split(" ") to split(/\s+/), which is what I use when I
need this functionality. [ ... ]

Click to expand...

However, the two cases are not equivalent:

irb(main):001:0> " spaces of doom ".split(/\s+/)
=> ["", "spaces", "of", "doom"]
irb(main):002:0> " spaces of doom ".split(" ")
=> ["spaces", "of", "doom"]

You'd have to compare split(" ") with strip.split(/\s+/). I'll do that
later this morning, when I have more time, and I'll then post my
results.

Click to expand...

You're right, the strip makes

[ ... etc. ... ]

Well, you saved me some time by running these yourself. Thanks.

Hmm ... if you know for sure ahead of time whether or not there's
leading whitespace, split(' ') is not the best.

However, without this knowledge about the existence of leading
whitespace or lack thereof, I believe that the best bet is still
split(' ') and its cousins split(nil) and split().

Using a random number of spaces between the items and a random amount of
leading whitespace (including none), I got the following results. Note
that the split(' ')/split(nil)/split() cases are the fastest ones when
you leave out the split(/\s+/) case. That one should really be left out
of these random whitespace tests, because it doesn't give the same
results as the others.

testArray = []

1000.times {
string = ''
(1..100).each { |x| string += ((" " * rand(3)) + x.to_s) }
testArray << string;
}

require 'profile'

def test1(s) s.split(' ') end
def test2(s) s.split(nil) end
def test3(s) s.split() end
def test4(s) s.split(/\s+/) end
def test5(s) s.strip.split(/\s+/) end
def test6(s) s.sub(/^\s+/, '').split(/\s+/) end

testArray.each { |x| test1(x) }
testArray.each { |x| test2(x) }
testArray.each { |x| test3(x) }
testArray.each { |x| test4(x) }
testArray.each { |x| test5(x) }
testArray.each { |x| test6(x) }

% cumulative self self total
time seconds seconds calls ms/call ms/call name
33.17 3.80 3.80 6000 0.63 0.63 String#split
31.54 7.42 3.62 1 3617.19 3617.19
Profiler__.start_profile
24.59 10.24 2.82 6 470.05 1911.46 Array#each
8.17 11.18 0.94 1000 0.94 1.66 Object#test5
6.68 11.95 0.77 1000 0.77 1.93 Object#test6
6.34 12.67 0.73 1000 0.73 1.20 Object#test1
6.27 13.39 0.72 1000 0.72 1.60 Object#test4
6.27 14.11 0.72 1000 0.72 1.13 Object#test3
5.18 14.70 0.59 1000 0.59 1.12 Object#test2
2.18 14.95 0.25 1000 0.25 0.25 String#sub
1.16 15.09 0.13 1000 0.13 0.13 String#strip
0.00 15.09 0.00 6 0.00 0.00
Module#method_added
0.00 15.09 0.00 1 0.00 11468.75 #toplevel

Mark Hubbart · Jul 14, 2004

Mark Hubbart said:
Mark Hubbart said:

[ ... ]

/(?<!^\s+)\s+/ ...uh....

...Maybe there's another way to do it... If anybody knows
I'd like to learn...

Click to expand...

as of now, this works:

irb(main):001:0> " spaces of doom ".split(nil)
=> ["spaces", "of", "doom"]

Why shouldn't nil be the only special case? If the $variable is set to
nil, it uses this kind of split anyway.

And it's no more characters to type than " "

Mark

Click to expand...

... and the following has even fewer characters to type:

irb(main):001:0> " spaces of doom ".split()
=> ["spaces", "of", "doom"]

... which is the same as:

irb(main):001:0> " spaces of doom ".split
=> ["spaces", "of", "doom"]

However, I was mistakenly thinking that #split(nil) would be exactly
the same as #split(" ")... but it isn't. I tried setting $; to "." and
it no longer worked. It seems that it should, though: when you want the
default behavior of #split, you set $; to nil. it seems rather logical
that #split(nil) should split using that default behavior. Oh well

cheers,
Mark

Class instance method	2	Jun 5, 2011
basic question: passing a modifiable argument to a routine	9	Jan 18, 2010
what's wrong with this picture?	7	Dec 2, 2010
Time + time.local	1	Jan 4, 2011
Ruby Hash Keys and Related Questions	6	Feb 23, 2011
extending ruby - handling errors	10	Aug 20, 2009
Something changed an instance variable ... and now I'm confused	3	Jan 8, 2010
How to dynamically include a module and update top level?	6	Nov 16, 2009

String.split

Tom Danielsen

Lloyd Zusman

Gavin Kistner

Cameron McBride

Cameron McBride

Bill Kelly

Chris Dutton

Mark Hubbart

Robert Klemme

Lloyd Zusman

Lloyd Zusman

Robert Klemme

Lloyd Zusman

Mark Hubbart

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads