[Q] difference between StringScanner#scan and Regexp#match

makoto kuwata · Feb 23, 2008

Hi,

I'm planning to implement StringScanner in pure Ruby.
But I found that it is hard to implement StringScanner#scan()
in pure Ruby, because of the difference between Regexp#match()
and StringScanner#scan().

StringScanner#scan() matches only when pattern matches at the
beginning (or at the current position) of input string.

require 'strscan'
input = 'foo 123'
scanner = StringScanner.new(input)
p scanner.scan(/\d+/) #=> nil

But Regexp#match() matches whenever input string contains pattern.

input = 'foo 123'
m = /\d+/.match(input)
p m[0] if m #=> "123"

Is it possible to restrict Regexp#match() to match only when
pattern starts at the beginning of input string?
My idea is to convert /regexp/ into /\A(?:regexp)/ every time,
but it is a litte ugly.
Is there any good idea to emulate StringScanner#scan in pure Ruby?

Michael Fellinger · Feb 24, 2008

Hi,

I'm planning to implement StringScanner in pure Ruby.
But I found that it is hard to implement StringScanner#scan()
in pure Ruby, because of the difference between Regexp#match()
and StringScanner#scan().

StringScanner#scan() matches only when pattern matches at the
beginning (or at the current position) of input string.

require 'strscan'
input = 'foo 123'
scanner = StringScanner.new(input)
p scanner.scan(/\d+/) #=> nil

But Regexp#match() matches whenever input string contains pattern.

input = 'foo 123'
m = /\d+/.match(input)
p m[0] if m #=> "123"

Is it possible to restrict Regexp#match() to match only when
pattern starts at the beginning of input string?
My idea is to convert /regexp/ into /\A(?:regexp)/ every time,
but it is a litte ugly.
Is there any good idea to emulate StringScanner#scan in pure Ruby?

input = 'foo 123'
if (input =~ /\d+/) == 0
p $& # doesn't happen
end

makoto kuwata · Feb 24, 2008

Michael Fellinger said:
input = 'foo 123'
if (input =~ /\d+/) == 0
p $& # doesn't happen
end

thank you Michael, but it is slow and not efficient, especially input
string is long, I think.

Michael Fellinger · Feb 24, 2008

thank you Michael, but it is slow and not efficient, especially input
string is long, I think.

You are right of course, but i don't know any other way, =~ is about
as fast as you can get already without modifying the regular
expression.

Caleb Clausen · Feb 25, 2008

Hi,

I'm planning to implement StringScanner in pure Ruby.
But I found that it is hard to implement StringScanner#scan()
in pure Ruby, because of the difference between Regexp#match()
and StringScanner#scan().

StringScanner#scan() matches only when pattern matches at the
beginning (or at the current position) of input string.

require 'strscan'
input = 'foo 123'
scanner = StringScanner.new(input)
p scanner.scan(/\d+/) #=> nil

But Regexp#match() matches whenever input string contains pattern.

input = 'foo 123'
m = /\d+/.match(input)
p m[0] if m #=> "123"

Is it possible to restrict Regexp#match() to match only when
pattern starts at the beginning of input string?
My idea is to convert /regexp/ into /\A(?:regexp)/ every time,
but it is a litte ugly.
Is there any good idea to emulate StringScanner#scan in pure Ruby?

I've done this kind of thing before, and rewriting the regex was the
best I could come up with. If Michael's suggestion is too slow for
you, then regex rewriting is the only game in town. (Actually, if you
can determine that your regexes always require some leading substring,
you might be able to optimize Michael's way a bit more...)

If speed is an issue, why not just use the existing StringScanner?
Creating regex's at runtime can cost you quite a bit in performance as
well... some caching can help here, if the same regexes are likely to
be encountered again.

You might want to take a look at String#index (and it's 2nd parameter)
rather than String#match or Regexp#match, as it allows you to start
matching wherever you want in the string, rather than just the
beginning. That doesn't help with your immediate question, but maybe
it'll give you some ideas of different ways to approach it.

Finally, a moment of self-promotion. My library 'sequence' implements
basically what you want (using regex rewriting, which in the full
elaboration gets rather involved). Maybe it could save you some
effort...

Michael Fellinger · Feb 25, 2008

Hi,

I'm planning to implement StringScanner in pure Ruby.
But I found that it is hard to implement StringScanner#scan()
in pure Ruby, because of the difference between Regexp#match()
and StringScanner#scan().

StringScanner#scan() matches only when pattern matches at the
beginning (or at the current position) of input string.

require 'strscan'
input = 'foo 123'
scanner = StringScanner.new(input)
p scanner.scan(/\d+/) #=> nil

But Regexp#match() matches whenever input string contains pattern.

input = 'foo 123'
m = /\d+/.match(input)
p m[0] if m #=> "123"

Is it possible to restrict Regexp#match() to match only when
pattern starts at the beginning of input string?
My idea is to convert /regexp/ into /\A(?:regexp)/ every time,
but it is a litte ugly.
Is there any good idea to emulate StringScanner#scan in pure Ruby?

Click to expand...

I've done this kind of thing before, and rewriting the regex was the
best I could come up with. If Michael's suggestion is too slow for
you, then regex rewriting is the only game in town. (Actually, if you
can determine that your regexes always require some leading substring,
you might be able to optimize Michael's way a bit more...)

If speed is an issue, why not just use the existing StringScanner?
Creating regex's at runtime can cost you quite a bit in performance as
well... some caching can help here, if the same regexes are likely to
be encountered again.

I, for one, started to work on a StringScanner replacement as well
just for fun. But it could be useful for rubinius to have a pure Ruby
implementation that can be augmented with C in some core areas by a
simple require.

You might want to take a look at String#index (and it's 2nd parameter)
rather than String#match or Regexp#match, as it allows you to start
matching wherever you want in the string, rather than just the
beginning. That doesn't help with your immediate question, but maybe
it'll give you some ideas of different ways to approach it.

Finally, a moment of self-promotion. My library 'sequence' implements
basically what you want (using regex rewriting, which in the full
elaboration gets rather involved). Maybe it could save you some
effort...

Thanks for the hint: http://sequence.rubyforge.org for anyone who is
too lazy to type it

^ manveru

[Q] specify start postion of Regexp matching	11	Nov 25, 2007
Regexp question difference between ^ and \A	3	Sep 19, 2009
Match a pattern multiple times, returning matches, captures andoffset?	9	Apr 5, 2011
Scan for Tokens	2	Nov 10, 2007
[SUMMARY] Bytecode Compiler (#100)	0	Nov 9, 2006
performance stats of String#scan, strscan and a homemade approach	1	Aug 2, 2004
Simple regexp question	0	Oct 26, 2005
Regexp Ruby selection	5	Jul 25, 2008

[Q] difference between StringScanner#scan and Regexp#match

makoto kuwata

Michael Fellinger

makoto kuwata

Michael Fellinger

Caleb Clausen

Michael Fellinger

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads