String#scan strangeness

G

Gennady

Hi there,

[linux.gfbs:281]gfb> ruby -v
ruby 1.6.8 (2003-10-15) [i686-linux]
[linux.gfbs:282]gfb> irb
irb(main):001:0> a = "a b c d "
=> "a b c d "
irb(main):002:0> a.scan %r{((\S+\s+){2,2})}
=> [["a b ", "b "], ["c d ", "d "]]
irb(main):003:0>

I am just wondering why String#scan "looses" a group in every match. I
would expect the following result:

=> [["a b ", "a ", "b "], ["c d ", "c ", "d "]]

or even

=> [["a b ", ["a ", "b "]], ["c d ", ["c ", "d "]]]

Where am I wrong in my expectations?

Thank you,
Gennady.

P.S.
It works the same way in Ruby 1.8.0 as well.
 
S

Simon Strandgaard

I am just wondering why String#scan "looses" a group in every match. I
would expect the following result:

when using sub-captures, then #scan returns an array of sub-captures.
This does not include capture[0].. which is the full-match.

"abcd".scan(/(.)(.)/)
#=> [["a", "b"], ["c", "d"]]

when not using sub-captures at all, then #scan returns only full-matches.

"abcd".scan(/../)
#=> ["ab", "cd"]
 
G

Gennady

Simon said:
I am just wondering why String#scan "looses" a group in every match. I
would expect the following result:


when using sub-captures, then #scan returns an array of sub-captures.
This does not include capture[0].. which is the full-match.

"abcd".scan(/(.)(.)/)
#=> [["a", "b"], ["c", "d"]]

when not using sub-captures at all, then #scan returns only full-matches.

"abcd".scan(/../)
#=> ["ab", "cd"]

In my original irb session capture I have sub-captures, moreover they
are nested:

[linux.gfbs:281]gfb> ruby -v
ruby 1.6.8 (2003-10-15) [i686-linux]
[linux.gfbs:282]gfb> irb
irb(main):001:0> a = "a b c d "
=> "a b c d "
irb(main):002:0> a.scan %r{((\S+\s+){2,2})}
ACTUAL => [["a b ", "b "], ["c d ", "d "]]
I EXPECT => [["a b ", "a ", "b "], ["c d ", "c ", "d "]]

irb(main):003:0>
 
S

Simon Strandgaard

irb(main):001:0> a = "a b c d "
=> "a b c d "
irb(main):002:0> a.scan %r{((\S+\s+){2,2})}
ACTUAL => [["a b ", "b "], ["c d ", "d "]]
I EXPECT => [["a b ", "a ", "b "], ["c d ", "c ", "d "]]
^^^^ ^^^^
these are not subcaptures
and are thus not being captured.
you need parentesis in order to capture them
 
D

David A. Black

Hi --

Simon said:
I am just wondering why String#scan "looses" a group in every match. I
would expect the following result:


when using sub-captures, then #scan returns an array of sub-captures.
This does not include capture[0].. which is the full-match.

"abcd".scan(/(.)(.)/)
#=> [["a", "b"], ["c", "d"]]

when not using sub-captures at all, then #scan returns only full-matches.

"abcd".scan(/../)
#=> ["ab", "cd"]

In my original irb session capture I have sub-captures, moreover they
are nested:

[linux.gfbs:281]gfb> ruby -v
ruby 1.6.8 (2003-10-15) [i686-linux]
[linux.gfbs:282]gfb> irb
irb(main):001:0> a = "a b c d "
=> "a b c d "
irb(main):002:0> a.scan %r{((\S+\s+){2,2})}
ACTUAL => [["a b ", "b "], ["c d ", "d "]]
I EXPECT => [["a b ", "a ", "b "], ["c d ", "c ", "d "]]

My understanding is: you've only got two sets of parentheses, so you
can have at most two captures; in other words, (){2} != ()() :) It's
purely positional: whatever is in the nth set of parentheses from the
left when the matching stops is the nth capture.

It's as if each () is a window which can move through the string but
can only hold one substring. So the second set of () sort of moves
from left to right:

(("a ")....)
("a "("b ")) # match completed

Result: $1 == "a b "
$2 == "b "


David
 
G

Gennady

^^^^^^^
^^^^^^^^^^^^^^
These are sub-captures

ACTUAL => [["a b ", "b "], ["c d ", "d "]]
I EXPECT => [["a b ", "a ", "b "], ["c d ", "c ", "d "]]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
And this is scan's result presented by irb
 
G

Gennady

David said:
Hi --

Simon Strandgaard wrote:

On Thursday 10 June 2004 18:49, Gennady wrote:


I am just wondering why String#scan "looses" a group in every match. I
would expect the following result:


when using sub-captures, then #scan returns an array of sub-captures.
This does not include capture[0].. which is the full-match.

"abcd".scan(/(.)(.)/)
#=> [["a", "b"], ["c", "d"]]

when not using sub-captures at all, then #scan returns only full-matches.

"abcd".scan(/../)
#=> ["ab", "cd"]

In my original irb session capture I have sub-captures, moreover they
are nested:

[linux.gfbs:281]gfb> ruby -v
ruby 1.6.8 (2003-10-15) [i686-linux]
[linux.gfbs:282]gfb> irb
irb(main):001:0> a = "a b c d "
=> "a b c d "
irb(main):002:0> a.scan %r{((\S+\s+){2,2})}
ACTUAL => [["a b ", "b "], ["c d ", "d "]]
I EXPECT => [["a b ", "a ", "b "], ["c d ", "c ", "d "]]


My understanding is: you've only got two sets of parentheses, so you
can have at most two captures; in other words, (){2} != ()() :) It's
purely positional: whatever is in the nth set of parentheses from the
left when the matching stops is the nth capture.

It's as if each () is a window which can move through the string but
can only hold one substring. So the second set of () sort of moves
from left to right:

(("a ")....)
("a "("b ")) # match completed

Result: $1 == "a b "
$2 == "b "


David

Thanks, David. It looks like this is the case. Actually, I solved my
problem by using the following regexp instead:

[linux.gfbs:71]gfb-ems-session_1> irb
irb(main):001:0> a = "a b c d "
=> "a b c d "
irb(main):002:0> a.scan %r{#{'(\S+\s+)' * 2}}
=> [["a ", "b "], ["c ", "d "]]
irb(main):003:0>

(My actual regexp is much bigger, I just used a simplified form for an
example)

Gennady.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,019
Latest member
RoxannaSta

Latest Threads

Top