String#split and groups in the field separator RE

mortee · Nov 1, 2007

Is this expected behaviour? I haven't seen anything related to this
mentioned in the API docs...

irb(main):057:0> s = 'a::b:::c::::d'
=> "a::b:::c::::d"
irb(main):058:0> s.split(/:/)
=> ["a", "", "b", "", "", "c", "", "", "", "d"] => OK
irb(main):059:0> s.split(/:+/)
=> ["a", "b", "c", "d"] => OK
irb(main):060:0> s.split(/

)+/)
=> ["a", ":", "b", ":", "c", ":", "d"] => ?
irb(main):061:0> s.split(/(

)+)/)
=> ["a", "::", ":", "b", ":::", ":", "c", "::::", ":", "d"] => ???
irb(main):062:0> s.split(/

+)/)
=> ["a", "::", "b", ":::", "c", "::::", "d"] => ???

mortee

7stud -- · Nov 1, 2007

mortee said:
Is this expected behaviour? I haven't seen anything related to this
mentioned in the API docs...

irb(main):057:0> s = 'a::b:::c::::d'
=> "a::b:::c::::d"
irb(main):058:0> s.split(/:/)
=> ["a", "", "b", "", "", "c", "", "", "", "d"] => OK
irb(main):059:0> s.split(/:+/)
=> ["a", "b", "c", "d"] => OK
irb(main):060:0> s.split(/)+/)
=> ["a", ":", "b", ":", "c", ":", "d"] => ?
irb(main):061:0> s.split(/()+)/)
=> ["a", "::", ":", "b", ":::", ":", "c", "::::", ":", "d"] => ???
irb(main):062:0> s.split(/+)/)
=> ["a", "::", "b", ":::", "c", "::::", "d"] => ???

It was unexpected behavior for me when I ran into it using python's
regex split() function a few months ago. Since it works the same way in
both languages, I would guess it might be a universal regex trait.

Daniel Sheppard · Nov 1, 2007

Is this expected behaviour? I haven't seen anything related to this

mentioned in the API docs...
=20
irb(main):060:0> s.split(/)+/)
=3D> ["a", ":", "b", ":", "c", ":", "d"] =09

Yes, any capture groups in the regex will be included in the split
array. If you want to use groups without capturing it into the split
array, use a non-capturing group - ie, /(?=3D,)/ rather than /(,)/

It is curious that it's not in the api doc... I must have learnt it from
somewhere...

Dan.

7stud -- · Nov 1, 2007

mortee said:
Is this expected behaviour? I haven't seen anything related to this
mentioned in the API docs...

irb(main):057:0> s = 'a::b:::c::::d'
=> "a::b:::c::::d"
irb(main):058:0> s.split(/:/)
=> ["a", "", "b", "", "", "c", "", "", "", "d"] => OK
irb(main):059:0> s.split(/:+/)
=> ["a", "b", "c", "d"] => OK
irb(main):060:0> s.split(/)+/)
=> ["a", ":", "b", ":", "c", ":", "d"] => ?
irb(main):061:0> s.split(/()+)/)
=> ["a", "::", ":", "b", ":::", ":", "c", "::::", ":", "d"] => ???
irb(main):062:0> s.split(/+)/)
=> ["a", "::", "b", ":::", "c", "::::", "d"] => ???

I guess I should mention that the rule I jotted down in the margin of my
book is: if the split() pattern has parenthesized sub groupings, the
result array will include the match for each subgroup--but not the whole
match.

Applying that rule to your examples:

irb(main):060:0> s.split(/)+/)
=> ["a", ":", "b", ":", "c", ":", "d"] => ?

The subgroup

) matches a single colon, so those matches are included
in the results,

irb(main):061:0> s.split(/()+)/)
=> ["a", "::", ":", "b", ":::", ":", "c", "::::", ":", "d"] => ???

The subgroup

) matches one colon and those results are included. The
subgroup (

)+) matches two, three, and four colons as it traverses the
strings and those results are included. Because groups are numbered by
their left most parentheses, the outer grouping comes first in the list.

irb(main):062:0> s.split(/+)/)
=> ["a", "::", "b", ":::", "c", "::::", "d"] => ???

The subgroup

+) matches two, three, and four colons as it traverses
the list, and those matches are included in the results.

And, here is an example of my own that shows that the whole match is not
included in the results--only the parenthesized sub groupings are
included:

str = 'a_::_b_:::_c_::::_d'
pattern = /_

+)_/

results = str.split(pattern)
p results

--output:--
["a", "::", "b", ":::", "c", "::::", "d"]

7stud -- · Nov 1, 2007

mortee said:
Is this expected behaviour? I haven't seen anything related to this
mentioned in the API docs...

$ri String#split
...
...
if _pattern_ is a +Regexp+, _str_ is divided where the pattern
matches. Whenever the pattern matches a zero-length string, _str_
is split into individual characters.
...

pickaxe2, p. 619 adds a line to the end of that description:

...
If pattern includes groups, these groups will be included in the
returned values.

mortee · Nov 1, 2007

Daniel said:
Yes, any capture groups in the regex will be included in the split
array. If you want to use groups without capturing it into the split
array, use a non-capturing group - ie, /(?=,)/ rather than /(,)/

It is curious that it's not in the api doc... I must have learnt it from
somewhere...

I guess I should mention that the rule I jotted down in the margin of my
book is: if the split() pattern has parenthesized sub groupings, the
result array will include the match for each subgroup--but not the whole
match.

Applying that rule to your examples:

irb(main):060:0> s.split(/)+/)
=> ["a", ":", "b", ":", "c", ":", "d"] => ?

Click to expand...

The subgroup ) matches a single colon, so those matches are included
in the results,

[...]

Thanks, that clarifies it, and the results make sense based on the rule.
However, I find it quite confusing to have parts of what I intend to be
part of the "separator" among the list of results. To say the least.

mortee

dan-ml · Nov 1, 2007

Daniel said:
Is this expected behaviour? I haven't seen anything related to this
mentioned in the API docs...

irb(main):060:0> s.split(/)+/)
=> ["a", ":", "b", ":", "c", ":", "d"]

Click to expand...

Yes, any capture groups in the regex will be included in the split
array. If you want to use groups without capturing it into the split
array, use a non-capturing group - ie, /(?=,)/ rather than /(,)/

/(?=,)/ is a lookahead match. I'm sure you really meant /(?:,)/

Rich Text Format (RTF) Document Builder in C++: Code and Features	0	Sep 28, 2025
Is this a bug or a whole new concept?	2	Mar 1, 2010
how to convert string to binary and back in Ruby 1.9?	9	Sep 1, 2009
pushing hash	3	Apr 23, 2004
Regular expressions, capture repeated groups	4	Jul 8, 2010
Why do arrays work this way?	9	Jun 17, 2005
howto split string with both comma and semicolon delimiters	4	Jun 12, 2008
[bug] String#split returns extra empty string	8	May 31, 2004

String#split and groups in the field separator RE

mortee

7stud --

Daniel Sheppard

7stud --

7stud --

mortee

dan-ml

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads