Scan for Tokens

Discussion in 'Ruby' started by Raul Parolari, Nov 11, 2007.

  1. I am looking for the best way to break an input string into individual
    tokens (I do not want to use a lexer library); I found some Ruby
    programs that do it by "nibbling" at the string, like this (for
    simplicity, the tokens are simply printed):
    str = "20 * sin(x) + ..."

    while (s.length > 0)
    if str.sub!(\A\s*(\d+)/) { |m| puts "nr: #{m}" ; '' }
    elsif str.sub!(\A\s*(\w+)/) { |m| puts "func: #{m}" ; '' }

    This works, but it is very inefficient as the string has to be
    continuously modified (a variation is to use str.match and then set str
    = post_match, that is
    probably even worse).
    I was looking for the equivalent of what Perl calls "walking the string"
    (if $str =~ /\G ../gcxms), picking up one token at the time at the point
    after the previous one was retrieved.

    I saw in the Pickaxe the mention of \G with scan; but I could not make
    scan work 'one token at the time'; I had to list all the tokens as
    argument, and then I had to find out which token had hit, ie:

    str.scan(/\G\s* (\d+ | [**]| [+] | [(] | ..)/xm) do |m|
    if m[0].match(/A\d+\z/) then puts "number: #{m}"
    elsif m[0].match(/A\[**]\z/) then puts "power: #{m}"
    ..

    It worked perfectly (almost to my surprise!); but it seems funny (unRuby
    like) to have to repeat the tokens (even if in my real code I used
    regexp vars to avoid hardcoding them twice, it still is a repetition).

    I looked at 4 Ruby books and I found only platitudes on the subject (or
    references to libraries). I would love to hear an elegant way to solve
    this,

    thanks!

    Raul
    --
    Posted via http://www.ruby-forum.com/.
     
    Raul Parolari, Nov 11, 2007
    #1
    1. Advertisements

  2. Raul Parolari

    Phrogz Guest

    On Nov 10, 6:07 pm, Raul Parolari <> wrote:
    > I am looking for the best way to break an input string into individual
    > tokens (I do not want to use a lexer library)


    Look at the StringScanner library[1] included with Ruby. It's simple,
    and it's fast. It's the basis of my TagTreeScanner library[2], which
    is specialized for parsing arbitrary text and converting it into
    hierarchically nested markup (e.g. XML).

    [1] http://ruby-doc.org/stdlib/libdoc/strscan/rdoc/index.html
    [2] http://phrogz.net/RubyLibs/OWLScribble/doc/tts.html
     
    Phrogz, Nov 11, 2007
    #2
    1. Advertisements

  3. Gavin Kistner wrote:
    > On Nov 10, 6:07 pm, Raul Parolari <> wrote:
    >> I am looking for the best way to break an input string into individual
    >> tokens (I do not want to use a lexer library)

    >
    > Look at the StringScanner library[1] included with Ruby. It's simple,
    > and it's fast. It's the basis of my TagTreeScanner library[2], which
    > is specialized for parsing arbitrary text and converting it into
    > hierarchically nested markup (e.g. XML).
    >
    > [1] http://ruby-doc.org/stdlib/libdoc/strscan/rdoc/index.html
    > [2] http://phrogz.net/RubyLibs/OWLScribble/doc/tts.html


    Gavin

    I was surprised at first that this basic capability was in a library,
    but
    StringScanner works beautifully, and it is indeed extremely fast.

    I will try your TagTreeScanner at the first chance

    Thank you

    Raul
    --
    Posted via http://www.ruby-forum.com/.
     
    Raul Parolari, Nov 11, 2007
    #3
    1. Advertisements

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Ben Holness
    Replies:
    0
    Views:
    5,602
    Ben Holness
    Jan 6, 2006
  2. =?Utf-8?B?TFc=?=

    string into tokens

    =?Utf-8?B?TFc=?=, Oct 13, 2005, in forum: ASP .Net
    Replies:
    1
    Views:
    489
    =?Utf-8?B?TFc=?=
    Oct 13, 2005
  3. Dale

    Struts Tokens - Newbie

    Dale, Feb 8, 2004, in forum: Java
    Replies:
    1
    Views:
    3,685
    Matt Parker
    Feb 10, 2004
  4. Per Magnus L?vold
    Replies:
    4
    Views:
    13,546
    Per Magnus L?vold
    Aug 12, 2004
  5. Christopher Benson-Manica

    String tokens/parsing

    Christopher Benson-Manica, Jan 29, 2004, in forum: C++
    Replies:
    10
    Views:
    1,062
    David Rubin
    Feb 2, 2004
  6. Adam Balgach
    Replies:
    2
    Views:
    762
    news-east
    Nov 28, 2004
  7. japh
    Replies:
    4
    Views:
    13,243
    Mike Hewson
    Jan 7, 2005
  8. sam++
    Replies:
    2
    Views:
    785
Loading...