Lexing and Parsing in Ruby.

Discussion in 'Ruby' started by John Carter, Nov 19, 2003.

  1. John Carter

    John Carter Guest

    Ooh this is so,so sweet....

    If you know anything about lexical analyzers and parsers, read this
    carefully and realize I have done something very very very Good
    here...

    # Here is a lexical analyzer that matches that breaks a
    # Graphviz "Dot" file into tokens...
    # digraph a_name {
    # a -> b;
    # c -> d -> e;
    # f -> {a;b;c}
    # d;
    # };"
    #
    # Yields "digraph","a_name","{","a","->","b",";",....]
    def tokenize(string)
    regexp = %r{ ^ (\s+ | [a-zA-Z_][a-zA-Z0-9_]* | ([-+][0-9\.]+|[0-9\.]+) | -> | \{ | \} | \; ) }x
    match_data = regexp.match( string)
    while match_data
    token = match_data[0]
    yield token unless token =~ %r{ ^\s }x
    match_data = regexp.match( match_data.post_match)
    end
    end


    # Now here is the trick...
    # If we map tokens to characters
    # then parsing becomes a matter of ....
    # Matching a regular Expression again!!
    def parse(string)
    # Build up a list of tokens in an array
    token_list = []
    # Make a one to one map between token type and the array
    token_string = ''

    tokenize(string) do |token|
    token_list << token
    if token == 'digraph'
    token_string += 'd'
    elsif token =~ %r{ [\{\};] }x
    token_string += token
    elsif token == '->'
    token_string += '>'
    elsif
    token_string += 'i'
    end
    end

    # Now our example is a string "di{i>i;i>i>i;i>{i;i;i}i;}"
    regexp = %r{ ^ d i \{ ((i ; | i (> i)+ ; | i > \{ i (; i)* \})*) \} ;? $ }x
    match_data = regexp.match( token_string)
    raise "Parse error '#{token_string}' doesn't match '#{regexp.source}'" unless match_data
    return [token_list[1], match_data[1], token_list[match_data.begin(1)...match_data.end(1)]]
    end

    name, edge_string, edge_list = parse(io)

    # Now edge_string is i>i;i>i>i;i>{i;i;i}i;
    # name is "a_name"
    # edge_list is ["a","->","b",";","c",...]


    John Carter Phone : (64)(3) 358 6639
    Tait Electronics Fax : (64)(3) 359 4632
    PO Box 1645 Christchurch Email :
    New Zealand

    A Million Monkeys can inflict worse things than just Shakespeare on
    your system.
    John Carter, Nov 19, 2003
    #1
    1. Advertising

  2. John Carter <> skrev i en
    nyhedsmeddelelse:p...
    > # Here is a lexical analyzer that matches that breaks a
    > # Graphviz "Dot" file into tokens...
    > # digraph a_name {
    > # a -> b;
    > # c -> d -> e;
    > # f -> {a;b;c}
    > # d;
    > # };"
    > #
    > # Yields "digraph","a_name","{","a","->","b",";",....]
    > def tokenize(string)


    Have you seen the LexerInRuby, its a nice example which shows how it
    can be done:
    http://www.rubygarden.org/ruby?LexerInRuby

    --
    Simon Strandgaard
    Simon Strandgaard, Nov 19, 2003
    #2
    1. Advertising

  3. "John Carter" <> schrieb im Newsbeitrag
    news:p...
    > Ooh this is so,so sweet....
    >
    > If you know anything about lexical analyzers and parsers, read this
    > carefully and realize I have done something very very very Good
    > here...


    Sorry, I don't see what's so very very very good about it. Will you
    enlighten me?

    > # Here is a lexical analyzer that matches that breaks a
    > # Graphviz "Dot" file into tokens...
    > # digraph a_name {
    > # a -> b;
    > # c -> d -> e;
    > # f -> {a;b;c}
    > # d;
    > # };"
    > #
    > # Yields "digraph","a_name","{","a","->","b",";",....]
    > def tokenize(string)
    > regexp = %r{ ^ (\s+ | [a-zA-Z_][a-zA-Z0-9_]* | ([-+][0-9\.]+|[0-9\.]+)

    | -> | \{ | \} | \; ) }x
    > match_data = regexp.match( string)
    > while match_data
    > token = match_data[0]
    > yield token unless token =~ %r{ ^\s }x
    > match_data = regexp.match( match_data.post_match)
    > end
    > end


    def tokenize(string)
    string.scan( /[a-z_][a-z_\d]* | [+-]?[.\d]+ | -> | [{};]/xi ) do |m|
    yield m
    end
    end

    > # Now here is the trick...
    > # If we map tokens to characters
    > # then parsing becomes a matter of ....
    > # Matching a regular Expression again!!


    I don't know the Graphviz format but this does only work as a real parser
    (i.e. one that verifies the input against a grammar) if that file format
    does not use nesting constructs with arbitrary depth, i.e., if it has a
    regular grammar (vs. context free grammar). This property of the language
    does not change if you replace tokens by shorter tokens.

    > def parse(string)
    > # Build up a list of tokens in an array
    > token_list = []
    > # Make a one to one map between token type and the array
    > token_string = ''
    >
    > tokenize(string) do |token|
    > token_list << token
    > if token == 'digraph'
    > token_string += 'd'
    > elsif token =~ %r{ [\{\};] }x
    > token_string += token
    > elsif token == '->'
    > token_string += '>'
    > elsif
    > token_string += 'i'
    > end
    > end


    Btw, note that string << 'i' is more efficient than string += 'i'.

    And you can combine the two steps with a slight modified regexp, since you
    know what you get during scanning already:

    def tokenize(string)
    tk=[]
    tl=""

    string.scan( /(digraph) | ([a-z_][a-z_\d]*) | ([+-]?[.\d]+) | (->) |
    ([{};])/xi ) do |di,name,num,arr,noise|
    if di
    tk << di
    tl << 'd'
    elsif name
    tk << name
    tl << 'i'
    elsif num
    tk << num
    tl << 'i'
    elsif arr
    tk << arr
    tl << '>'
    elsif noise
    tk << noise
    tl << noise
    else
    # never here
    end
    end

    [tk, tl]
    end

    Now you can do

    irb(main):144:0> tokens, pattern = tokenize( "digraph a_name {
    irb(main):145:1" a -> b;
    irb(main):146:1" c -> d -> e;
    irb(main):147:1" f -> {a;b;c}
    irb(main):148:1" d;
    irb(main):149:1" };" )
    => [["digraph", "a_name", "{", "a", "->", "b", "c", "->", "d", "->", "e",
    "f", "
    ->", "{", "a", "b", "c", "}", "d", "}"], "di{i>ii>i>ii>{iii}i}"]
    irb(main):150:0> tokens
    => ["digraph", "a_name", "{", "a", "->", "b", "c", "->", "d", "->", "e",
    "f", "-
    >", "{", "a", "b", "c", "}", "d", "}"]

    irb(main):151:0> pattern
    => "di{i>ii>i>ii>{iii}i}"
    irb(main):152:0>

    Rest as before.

    Cheers

    robert


    > # Now our example is a string "di{i>i;i>i>i;i>{i;i;i}i;}"
    > regexp = %r{ ^ d i \{ ((i ; | i (> i)+ ; | i > \{ i (; i)* \})*) \} ;?

    $ }x
    > match_data = regexp.match( token_string)
    > raise "Parse error '#{token_string}' doesn't match '#{regexp.source}'"

    unless match_data
    > return [token_list[1], match_data[1],

    token_list[match_data.begin(1)...match_data.end(1)]]
    > end
    >
    > name, edge_string, edge_list = parse(io)
    >
    > # Now edge_string is i>i;i>i>i;i>{i;i;i}i;
    > # name is "a_name"
    > # edge_list is ["a","->","b",";","c",...]
    >
    >
    > John Carter Phone : (64)(3) 358 6639
    > Tait Electronics Fax : (64)(3) 359 4632
    > PO Box 1645 Christchurch Email :
    > New Zealand
    >
    > A Million Monkeys can inflict worse things than just Shakespeare on
    > your system.
    >
    >
    Robert Klemme, Nov 19, 2003
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Christopher Diggins
    Replies:
    0
    Views:
    608
    Christopher Diggins
    Jul 9, 2007
  2. Christopher Diggins
    Replies:
    0
    Views:
    433
    Christopher Diggins
    Jul 9, 2007
  3. Ole Nielsby

    Lexing the ' char

    Ole Nielsby, Nov 2, 2007, in forum: VHDL
    Replies:
    3
    Views:
    397
    diogratia
    Nov 4, 2007
  4. Martin DeMello

    simple lexing/parsing task

    Martin DeMello, Feb 9, 2004, in forum: Ruby
    Replies:
    4
    Views:
    134
    Martin DeMello
    Feb 10, 2004
  5. Aryeh M. Friedman
    Replies:
    34
    Views:
    562
    Aryeh M. Friedman
    Jan 7, 2013
Loading...

Share This Page