Scan for Tokens

Discussion in 'Ruby' started by Raul Parolari, Nov 11, 2007.

  1. I am looking for the best way to break an input string into individual
    tokens (I do not want to use a lexer library); I found some Ruby
    programs that do it by "nibbling" at the string, like this (for
    simplicity, the tokens are simply printed):
    str = "20 * sin(x) + ..."

    while (s.length > 0)
    if str.sub!(\A\s*(\d+)/) { |m| puts "nr: #{m}" ; '' }
    elsif str.sub!(\A\s*(\w+)/) { |m| puts "func: #{m}" ; '' }

    This works, but it is very inefficient as the string has to be
    continuously modified (a variation is to use str.match and then set str
    = post_match, that is
    probably even worse).
    I was looking for the equivalent of what Perl calls "walking the string"
    (if $str =~ /\G ../gcxms), picking up one token at the time at the point
    after the previous one was retrieved.

    I saw in the Pickaxe the mention of \G with scan; but I could not make
    scan work 'one token at the time'; I had to list all the tokens as
    argument, and then I had to find out which token had hit, ie:

    str.scan(/\G\s* (\d+ | [**]| [+] | [(] | ..)/xm) do |m|
    if m[0].match(/A\d+\z/) then puts "number: #{m}"
    elsif m[0].match(/A\[**]\z/) then puts "power: #{m}"
    ..

    It worked perfectly (almost to my surprise!); but it seems funny (unRuby
    like) to have to repeat the tokens (even if in my real code I used
    regexp vars to avoid hardcoding them twice, it still is a repetition).

    I looked at 4 Ruby books and I found only platitudes on the subject (or
    references to libraries). I would love to hear an elegant way to solve
    this,

    thanks!

    Raul
    --
    Posted via http://www.ruby-forum.com/.
     
    Raul Parolari, Nov 11, 2007
    #1
    1. Advertising

  2. Raul Parolari

    Phrogz Guest

    On Nov 10, 6:07 pm, Raul Parolari <> wrote:
    > I am looking for the best way to break an input string into individual
    > tokens (I do not want to use a lexer library)


    Look at the StringScanner library[1] included with Ruby. It's simple,
    and it's fast. It's the basis of my TagTreeScanner library[2], which
    is specialized for parsing arbitrary text and converting it into
    hierarchically nested markup (e.g. XML).

    [1] http://ruby-doc.org/stdlib/libdoc/strscan/rdoc/index.html
    [2] http://phrogz.net/RubyLibs/OWLScribble/doc/tts.html
     
    Phrogz, Nov 11, 2007
    #2
    1. Advertising

  3. Gavin Kistner wrote:
    > On Nov 10, 6:07 pm, Raul Parolari <> wrote:
    >> I am looking for the best way to break an input string into individual
    >> tokens (I do not want to use a lexer library)

    >
    > Look at the StringScanner library[1] included with Ruby. It's simple,
    > and it's fast. It's the basis of my TagTreeScanner library[2], which
    > is specialized for parsing arbitrary text and converting it into
    > hierarchically nested markup (e.g. XML).
    >
    > [1] http://ruby-doc.org/stdlib/libdoc/strscan/rdoc/index.html
    > [2] http://phrogz.net/RubyLibs/OWLScribble/doc/tts.html


    Gavin

    I was surprised at first that this basic capability was in a library,
    but
    StringScanner works beautifully, and it is indeed extremely fast.

    I will try your TagTreeScanner at the first chance

    Thank you

    Raul
    --
    Posted via http://www.ruby-forum.com/.
     
    Raul Parolari, Nov 11, 2007
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Ben Holness
    Replies:
    0
    Views:
    5,472
    Ben Holness
    Jan 6, 2006
  2. =?Utf-8?B?RWx0b24gVw==?=

    RE: string into tokens

    =?Utf-8?B?RWx0b24gVw==?=, Oct 13, 2005, in forum: ASP .Net
    Replies:
    0
    Views:
    453
    =?Utf-8?B?RWx0b24gVw==?=
    Oct 13, 2005
  3. =?Utf-8?B?TFc=?=

    string into tokens

    =?Utf-8?B?TFc=?=, Oct 13, 2005, in forum: ASP .Net
    Replies:
    1
    Views:
    402
    =?Utf-8?B?TFc=?=
    Oct 13, 2005
  4. Dale

    Struts Tokens - Newbie

    Dale, Feb 8, 2004, in forum: Java
    Replies:
    1
    Views:
    3,601
    Matt Parker
    Feb 10, 2004
  5. Per Magnus L?vold
    Replies:
    4
    Views:
    12,977
    Per Magnus L?vold
    Aug 12, 2004
Loading...

Share This Page