Scan for Tokens

R

Raul Parolari

I am looking for the best way to break an input string into individual
tokens (I do not want to use a lexer library); I found some Ruby
programs that do it by "nibbling" at the string, like this (for
simplicity, the tokens are simply printed):
str = "20 * sin(x) + ..."

while (s.length > 0)
if str.sub!(\A\s*(\d+)/) { |m| puts "nr: #{m}" ; '' }
elsif str.sub!(\A\s*(\w+)/) { |m| puts "func: #{m}" ; '' }

This works, but it is very inefficient as the string has to be
continuously modified (a variation is to use str.match and then set str
= post_match, that is
probably even worse).
I was looking for the equivalent of what Perl calls "walking the string"
(if $str =~ /\G ../gcxms), picking up one token at the time at the point
after the previous one was retrieved.

I saw in the Pickaxe the mention of \G with scan; but I could not make
scan work 'one token at the time'; I had to list all the tokens as
argument, and then I had to find out which token had hit, ie:

str.scan(/\G\s* (\d+ | [**]| [+] | [(] | ..)/xm) do |m|
if m[0].match(/A\d+\z/) then puts "number: #{m}"
elsif m[0].match(/A\[**]\z/) then puts "power: #{m}"
..

It worked perfectly (almost to my surprise!); but it seems funny (unRuby
like) to have to repeat the tokens (even if in my real code I used
regexp vars to avoid hardcoding them twice, it still is a repetition).

I looked at 4 Ruby books and I found only platitudes on the subject (or
references to libraries). I would love to hear an elegant way to solve
this,

thanks!

Raul
 
P

Phrogz

I am looking for the best way to break an input string into individual
tokens (I do not want to use a lexer library)

Look at the StringScanner library[1] included with Ruby. It's simple,
and it's fast. It's the basis of my TagTreeScanner library[2], which
is specialized for parsing arbitrary text and converting it into
hierarchically nested markup (e.g. XML).

[1] http://ruby-doc.org/stdlib/libdoc/strscan/rdoc/index.html
[2] http://phrogz.net/RubyLibs/OWLScribble/doc/tts.html
 
R

Raul Parolari

Gavin said:
I am looking for the best way to break an input string into individual
tokens (I do not want to use a lexer library)

Look at the StringScanner library[1] included with Ruby. It's simple,
and it's fast. It's the basis of my TagTreeScanner library[2], which
is specialized for parsing arbitrary text and converting it into
hierarchically nested markup (e.g. XML).

[1] http://ruby-doc.org/stdlib/libdoc/strscan/rdoc/index.html
[2] http://phrogz.net/RubyLibs/OWLScribble/doc/tts.html

Gavin

I was surprised at first that this basic capability was in a library,
but
StringScanner works beautifully, and it is indeed extremely fast.

I will try your TagTreeScanner at the first chance

Thank you

Raul
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,774
Messages
2,569,599
Members
45,175
Latest member
Vinay Kumar_ Nevatia
Top