String#split and groups in the field separator RE

M

mortee

Is this expected behaviour? I haven't seen anything related to this
mentioned in the API docs...

irb(main):057:0> s = 'a::b:::c::::d'
=> "a::b:::c::::d"
irb(main):058:0> s.split(/:/)
=> ["a", "", "b", "", "", "c", "", "", "", "d"] => OK
irb(main):059:0> s.split(/:+/)
=> ["a", "b", "c", "d"] => OK
irb(main):060:0> s.split(/:))+/)
=> ["a", ":", "b", ":", "c", ":", "d"] => ?
irb(main):061:0> s.split(/(:))+)/)
=> ["a", "::", ":", "b", ":::", ":", "c", "::::", ":", "d"] => ???
irb(main):062:0> s.split(/:)+)/)
=> ["a", "::", "b", ":::", "c", "::::", "d"] => ???

mortee
 
7

7stud --

mortee said:
Is this expected behaviour? I haven't seen anything related to this
mentioned in the API docs...

irb(main):057:0> s = 'a::b:::c::::d'
=> "a::b:::c::::d"
irb(main):058:0> s.split(/:/)
=> ["a", "", "b", "", "", "c", "", "", "", "d"] => OK
irb(main):059:0> s.split(/:+/)
=> ["a", "b", "c", "d"] => OK
irb(main):060:0> s.split(/:))+/)
=> ["a", ":", "b", ":", "c", ":", "d"] => ?
irb(main):061:0> s.split(/(:))+)/)
=> ["a", "::", ":", "b", ":::", ":", "c", "::::", ":", "d"] => ???
irb(main):062:0> s.split(/:)+)/)
=> ["a", "::", "b", ":::", "c", "::::", "d"] => ???

It was unexpected behavior for me when I ran into it using python's
regex split() function a few months ago. Since it works the same way in
both languages, I would guess it might be a universal regex trait.
 
D

Daniel Sheppard

Is this expected behaviour? I haven't seen anything related to this
mentioned in the API docs...
=20
irb(main):060:0> s.split(/:))+/)
=3D> ["a", ":", "b", ":", "c", ":", "d"] =09

Yes, any capture groups in the regex will be included in the split
array. If you want to use groups without capturing it into the split
array, use a non-capturing group - ie, /(?=3D,)/ rather than /(,)/

It is curious that it's not in the api doc... I must have learnt it from
somewhere...

Dan.
 
7

7stud --

mortee said:
Is this expected behaviour? I haven't seen anything related to this
mentioned in the API docs...

irb(main):057:0> s = 'a::b:::c::::d'
=> "a::b:::c::::d"
irb(main):058:0> s.split(/:/)
=> ["a", "", "b", "", "", "c", "", "", "", "d"] => OK
irb(main):059:0> s.split(/:+/)
=> ["a", "b", "c", "d"] => OK
irb(main):060:0> s.split(/:))+/)
=> ["a", ":", "b", ":", "c", ":", "d"] => ?
irb(main):061:0> s.split(/(:))+)/)
=> ["a", "::", ":", "b", ":::", ":", "c", "::::", ":", "d"] => ???
irb(main):062:0> s.split(/:)+)/)
=> ["a", "::", "b", ":::", "c", "::::", "d"] => ???

I guess I should mention that the rule I jotted down in the margin of my
book is: if the split() pattern has parenthesized sub groupings, the
result array will include the match for each subgroup--but not the whole
match.

Applying that rule to your examples:
irb(main):060:0> s.split(/:))+/)
=> ["a", ":", "b", ":", "c", ":", "d"] => ?

The subgroup :)) matches a single colon, so those matches are included
in the results,
irb(main):061:0> s.split(/(:))+)/)
=> ["a", "::", ":", "b", ":::", ":", "c", "::::", ":", "d"] => ???

The subgroup :)) matches one colon and those results are included. The
subgroup (:))+) matches two, three, and four colons as it traverses the
strings and those results are included. Because groups are numbered by
their left most parentheses, the outer grouping comes first in the list.
irb(main):062:0> s.split(/:)+)/)
=> ["a", "::", "b", ":::", "c", "::::", "d"] => ???

The subgroup :)+) matches two, three, and four colons as it traverses
the list, and those matches are included in the results.

And, here is an example of my own that shows that the whole match is not
included in the results--only the parenthesized sub groupings are
included:

str = 'a_::_b_:::_c_::::_d'
pattern = /_:)+)_/

results = str.split(pattern)
p results

--output:--
["a", "::", "b", ":::", "c", "::::", "d"]
 
7

7stud --

mortee said:
Is this expected behaviour? I haven't seen anything related to this
mentioned in the API docs...


$ri String#split
...
...
if _pattern_ is a +Regexp+, _str_ is divided where the pattern
matches. Whenever the pattern matches a zero-length string, _str_
is split into individual characters.
...



pickaxe2, p. 619 adds a line to the end of that description:

...
If pattern includes groups, these groups will be included in the
returned values.
 
M

mortee

Daniel said:
Yes, any capture groups in the regex will be included in the split
array. If you want to use groups without capturing it into the split
array, use a non-capturing group - ie, /(?=,)/ rather than /(,)/

It is curious that it's not in the api doc... I must have learnt it from
somewhere...
I guess I should mention that the rule I jotted down in the margin of my
book is: if the split() pattern has parenthesized sub groupings, the
result array will include the match for each subgroup--but not the whole
match.

Applying that rule to your examples:
irb(main):060:0> s.split(/:))+/)
=> ["a", ":", "b", ":", "c", ":", "d"] => ?

The subgroup :)) matches a single colon, so those matches are included
in the results,
[...]

Thanks, that clarifies it, and the results make sense based on the rule.
However, I find it quite confusing to have parts of what I intend to be
part of the "separator" among the list of results. To say the least.

mortee
 
D

dan-ml

Daniel said:
Is this expected behaviour? I haven't seen anything related to this
mentioned in the API docs...

irb(main):060:0> s.split(/:))+/)
=> ["a", ":", "b", ":", "c", ":", "d"]

Yes, any capture groups in the regex will be included in the split
array. If you want to use groups without capturing it into the split
array, use a non-capturing group - ie, /(?=,)/ rather than /(,)/

/(?=,)/ is a lookahead match. I'm sure you really meant /(?:,)/
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top