Possible bug with StringScanner class

John Halderman · Jul 22, 2005

------=_Part_843_19167012.1122060553053
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

I'm not sure if this is a bug or intentional behavior, so I thought I would=
=20
post it here to see what the community thought of what was happening. If yo=
u=20
set up a StringScanner object to perform iterative matching on a string the=
=20
behavior of \A and ^ seem to always match. It seems to me that \A should=20
only match if it is the first match performed, and ^ should only match if=
=20
bol? returns true, which should be after a \n or if it is the first match=
=20
performed. Here is some code I ran in irb to illustrate the problem:

require 'strscan'
sc =3D StringScanner.new("the white elephant eats grass")
sc.scan(/the\s+/)
sc.bol?
sc.scan(/^white\s+/)
sc.scan(/\Aelephant\s+/)

this code produced the following result.
irb(main):001:0> require 'strscan'
=3D> true
irb(main):002:0> sc =3D StringScanner.new("the white elephant eats grass")
=3D> #<StringScanner 0/29 @ "the w...">
irb(main):003:0> sc.scan(/the\s+/)
=3D> "the "
irb(main):004:0> sc.bol?
=3D> false
irb(main):005:0> sc.scan(/^white\s+/)
=3D> "white "
irb(main):006:0> sc.scan(/\Aelephant\s+/)
=3D> "elephant "

Any thoughts and or advice on this matter are greatly appreciated.

-John Halderman

------=_Part_843_19167012.1122060553053--

Eric Mahurin · Jul 22, 2005

--- John Halderman said:
I'm not sure if this is a bug or intentional behavior, so I
thought I would=20
post it here to see what the community thought of what was
happening. If you=20
set up a StringScanner object to perform iterative matching
on a string the=20
behavior of \A and ^ seem to always match. It seems to me
that \A should=20
only match if it is the first match performed, and ^ should
only match if=20
bol? returns true, which should be after a \n or if it is the
first match=20
performed.

You should think of the current position as the beginning of
the string for matching. In addition, the regexp that scan
gets is implicitly anchored to that spot. So specifiing \A or
^ at the beginning of a regexp for scan is redundant.

=09
__________________________________=20
Do you Yahoo!?=20
Yahoo! Mail - Find what you need with new enhanced search.=20
http://info.mail.yahoo.com/mail_250

John Halderman · Jul 22, 2005

------=_Part_901_33088074.1122063180543
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

=20

=20
You should think of the current position as the beginning of
the string for matching. In addition, the regexp that scan
gets is implicitly anchored to that spot. So specifiing \A or
^ at the beginning of a regexp for scan is redundant.

There is nothing in the documentation to suggest that the current position=
=20
should be considered the beginning of a string for matching purposes, only=
=20
that any match must start at that position. That would mean a regexp=20
beginning with ^ would need the current position to be preceded by \n or be=
=20
the at the beginning of the string in order for it to match. Furthermore,=
=20
the existence of bol? suggests that the current position is not to be=20
considered the beginning of the line. As for whether is should be considere=
d=20
the beginning of the string, that remains ambiguous, although I believe it=
=20
makes more sense for it not to be so.=20

__________________________________
Do you Yahoo!?
Yahoo! Mail - Find what you need with new enhanced search.
http://info.mail.yahoo.com/mail_250
=20

------=_Part_901_33088074.1122063180543--

Eric Mahurin · Jul 22, 2005

--- John Halderman said:
=20
=20
There is nothing in the documentation to suggest that the
current position=20
should be considered the beginning of a string for matching
purposes, only=20
that any match must start at that position. That would mean a
regexp=20
beginning with ^ would need the current position to be
preceded by \n or be=20
the at the beginning of the string in order for it to match.
Furthermore,=20
the existence of bol? suggests that the current position is
not to be=20
considered the beginning of the line. As for whether is
should be considered=20
the beginning of the string, that remains ambiguous, although
I believe it=20
makes more sense for it not to be so.=20

I think it makes perfect sense. scan/scan_until/etc only can
look at what is after the current position. They have no
visibility to what is before the current position. So, you
should consider it to be the beginning of the string for
matching purposes. Whether you like it or not, that is the way
it works and I think it is intentional.

=09
____________________________________________________
Start your day with Yahoo! - make it your home page=20
http://www.yahoo.com/r/hs=20
=20

John Halderman · Jul 22, 2005

------=_Part_941_32383633.1122065349990
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

I think it makes perfect sense. scan/scan_until/etc only can
look at what is after the current position. They have no
visibility to what is before the current position. So, you
should consider it to be the beginning of the string for
matching purposes. Whether you like it or not, that is the way
it works and I think it is intentional.

I understand the way it works, I am not debating how it works. However I do=
=20
not think this is intentional behavior. From the documentation provided via=
=20
ri:

Scanning a string means remembering the position of a _scan
pointer_, which is just an index. The point of scanning is to move
forward a bit at a time, so matches are sought after the scan
pointer; usually immediately after it.

Given the string "test string", here are the pertinent scan pointer
positions:

t e s t s t r i n g
0 1 2 ... 1
0

When you #scan for a pattern (a regular expression), the match must
occur at the character after the scan pointer. If you use
#scan_until, then the match can occur anywhere after the scan
pointer. In both cases, the scan pointer moves _just beyond_ the
last character of the match, ready to scan again from the next
character onwards. This is demonstrated by the example above.

This says nothing about what scan has available to it when it matches, only=
=20
where the match must occur. When you match a ^ the match always happens at=
=20
the first character after a \n or at the beginning of a string. Therefore=
=20
the position of the match would still be valid the purposes of scan even=20
though the \n was before the current scan position. This can be demonstrate=
d=20
with the following code:

r =3D /^abc/
s =3D "efg\nabc"
m =3D r.match(s)
s[m.begin(0)..m.end(0)]

which produces the following output:

irb(main):001:0> r =3D /^abc/
=3D> /^abc/
irb(main):002:0> s =3D "efg\nabc"
=3D> "efg\nabc"
irb(main):003:0> m =3D r.match(s)
=3D> #<MatchData:0xb7eaf45c>
irb(main):004:0> s[m.begin(0)..m.end(0)]
=3D> "abc"

As you can see, the \n is not included in the match but is required for the=
=20
match to occur. Therefore I believe it only makes sense that scan should be=
=20
using the bol? to determine if a regexp beginning with a ^ matches, not=20
always matching that. That is why it seems to me that this is an oversight=
=20
in the implementation of StringScanner.

-John Halderman

------=_Part_941_32383633.1122065349990--

Eric Mahurin · Jul 22, 2005

--- John Halderman said:
=20
=20
I understand the way it works, I am not debating how it
works. However I do=20
not think this is intentional behavior. From the
documentation provided via=20
ri:
=20
Scanning a string means remembering the position of a _scan
pointer_, which is just an index. The point of scanning is to
move
forward a bit at a time, so matches are sought after the scan
pointer; usually immediately after it.

I think quoting the documentation only hurt your argument. It
says "matches are sought after the scan pointer" right there.=20
I believe most of the methods do exactly that, but there are
some exceptions that look/go backwards. You mentioned one:
bol? - looks back one character to see if it is a newline or at
pos=3D0. I think the reason they put this method in is for the
purpose you are wanting. What's wrong with using:

scanner.bol? and scanner.scan(/.../)

instead of trying to get this to do what you want:

scanner.scan(/^.../)

BTW, if you want to try to a more general
iterator(external)/cursor/stream/scanner, try my cursor
package:

http://rubyforge.org/projects/cursor/

I have some regexp stuff in there that acts like StringScanner,
but it was an afterthought and I will probably redo its
interface.

=09
____________________________________________________
Start your day with Yahoo! - make it your home page=20
http://www.yahoo.com/r/hs=20
=20

John Halderman · Jul 22, 2005

------=_Part_974_31476390.1122068800468
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

=20

=20
=20
I think quoting the documentation only hurt your argument. It
says "matches are sought after the scan pointer" right there.
I believe most of the methods do exactly that, but there are
some exceptions that look/go backwards. You mentioned one:
bol? - looks back one character to see if it is a newline or at
pos=3D0. I think the reason they put this method in is for the
purpose you are wanting. What's wrong with using:
=20
scanner.bol? and scanner.scan(/.../)
=20
instead of trying to get this to do what you want:
=20
scanner.scan(/^.../)
=20
=20
BTW, if you want to try to a more general
iterator(external)/cursor/stream/scanner, try my cursor
package:
=20
http://rubyforge.org/projects/cursor/
=20
I have some regexp stuff in there that acts like StringScanner,
but it was an afterthought and I will probably redo its
interface.
=20
=20
=20
=20
____________________________________________________
Start your day with Yahoo! - make it your home page
http://www.yahoo.com/r/hs
=20
=20
=20

Like I was saying though, ^ doesnt match at the newline, but only requires=
=20
that a newline exist before the character it is considering for matching.=
=20
Technically since StringScanner simply uses a pointer to store the current=
=20
match location the entrie string is still available to you to perform tests=
=20
on. In order to implement this you would only need one character lookback,=
=20
which wouldn't be a huge deal from what I can tell especially in light of=
=20
the fact that bol? is already inplemented.=20

The reason I can't use bol? and a regexp, is that I am not writing the=20
regexp and I will not know what to expect. It isn't even a matter of what I=
=20
am implementing but a question of whether or not StringScanner is=20
implemented correctly and the documentation is incorrect, or if the=20
documentation is correct and the StringScanner is implemented incorrectly. =
I=20
believe it to be the later because it would provide more useful=20
functionality, and a more correct interpretation of regular expressions. Fo=
r=20
my own purposes I will have to implement my own StringScanner type class=20
that meets my requirements, but I would like to see the discrepancies=20
between the documentation and the StringScanner class resolved.

Thanks for your input.
-j

------=_Part_974_31476390.1122068800468--

[Q] difference between StringScanner#scan and Regexp#match	5	Feb 24, 2008
POLS violation? /\s*/ no match at StringScanner end	3	Mar 21, 2006
Strange bug in irb1.9	7	Mar 24, 2009
StringScanner and UTF-8 in ruby 1.9	0	Sep 16, 2009
different output for same expression?	3	Dec 5, 2009
Weird Behavior with Rays in C and OpenGL	4	Feb 13, 2024
Weird bug in SQLite3	1	Feb 5, 2006
Possible bug in FileUtils::fu_mkdir (Errno::EEXIST)	0	Jul 10, 2010

Possible bug with StringScanner class

John Halderman

Eric Mahurin

John Halderman

Eric Mahurin

John Halderman

Eric Mahurin

John Halderman

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads