Lexing and Parsing in Ruby.

J

John Carter

Ooh this is so,so sweet....

If you know anything about lexical analyzers and parsers, read this
carefully and realize I have done something very very very Good
here...

# Here is a lexical analyzer that matches that breaks a
# Graphviz "Dot" file into tokens...
# digraph a_name {
# a -> b;
# c -> d -> e;
# f -> {a;b;c}
# d;
# };"
#
# Yields "digraph","a_name","{","a","->","b",";",....]
def tokenize(string)
regexp = %r{ ^ (\s+ | [a-zA-Z_][a-zA-Z0-9_]* | ([-+][0-9\.]+|[0-9\.]+) | -> | \{ | \} | \; ) }x
match_data = regexp.match( string)
while match_data
token = match_data[0]
yield token unless token =~ %r{ ^\s }x
match_data = regexp.match( match_data.post_match)
end
end


# Now here is the trick...
# If we map tokens to characters
# then parsing becomes a matter of ....
# Matching a regular Expression again!!
def parse(string)
# Build up a list of tokens in an array
token_list = []
# Make a one to one map between token type and the array
token_string = ''

tokenize(string) do |token|
token_list << token
if token == 'digraph'
token_string += 'd'
elsif token =~ %r{ [\{\};] }x
token_string += token
elsif token == '->'
token_string += '>'
elsif
token_string += 'i'
end
end

# Now our example is a string "di{i>i;i>i>i;i>{i;i;i}i;}"
regexp = %r{ ^ d i \{ ((i ; | i (> i)+ ; | i > \{ i (; i)* \})*) \} ;? $ }x
match_data = regexp.match( token_string)
raise "Parse error '#{token_string}' doesn't match '#{regexp.source}'" unless match_data
return [token_list[1], match_data[1], token_list[match_data.begin(1)...match_data.end(1)]]
end

name, edge_string, edge_list = parse(io)

# Now edge_string is i>i;i>i>i;i>{i;i;i}i;
# name is "a_name"
# edge_list is ["a","->","b",";","c",...]


John Carter Phone : (64)(3) 358 6639
Tait Electronics Fax : (64)(3) 359 4632
PO Box 1645 Christchurch Email : (e-mail address removed)
New Zealand

A Million Monkeys can inflict worse things than just Shakespeare on
your system.
 
S

Simon Strandgaard

John Carter said:
# Here is a lexical analyzer that matches that breaks a
# Graphviz "Dot" file into tokens...
# digraph a_name {
# a -> b;
# c -> d -> e;
# f -> {a;b;c}
# d;
# };"
#
# Yields "digraph","a_name","{","a","->","b",";",....]
def tokenize(string)

Have you seen the LexerInRuby, its a nice example which shows how it
can be done:
http://www.rubygarden.org/ruby?LexerInRuby
 
R

Robert Klemme

John Carter said:
Ooh this is so,so sweet....

If you know anything about lexical analyzers and parsers, read this
carefully and realize I have done something very very very Good
here...

Sorry, I don't see what's so very very very good about it. Will you
enlighten me?
# Here is a lexical analyzer that matches that breaks a
# Graphviz "Dot" file into tokens...
# digraph a_name {
# a -> b;
# c -> d -> e;
# f -> {a;b;c}
# d;
# };"
#
# Yields "digraph","a_name","{","a","->","b",";",....]
def tokenize(string)
regexp = %r{ ^ (\s+ | [a-zA-Z_][a-zA-Z0-9_]* | ([-+][0-9\.]+|[0-9\.]+) | -> | \{ | \} | \; ) }x
match_data = regexp.match( string)
while match_data
token = match_data[0]
yield token unless token =~ %r{ ^\s }x
match_data = regexp.match( match_data.post_match)
end
end

def tokenize(string)
string.scan( /[a-z_][a-z_\d]* | [+-]?[.\d]+ | -> | [{};]/xi ) do |m|
yield m
end
end
# Now here is the trick...
# If we map tokens to characters
# then parsing becomes a matter of ....
# Matching a regular Expression again!!

I don't know the Graphviz format but this does only work as a real parser
(i.e. one that verifies the input against a grammar) if that file format
does not use nesting constructs with arbitrary depth, i.e., if it has a
regular grammar (vs. context free grammar). This property of the language
does not change if you replace tokens by shorter tokens.
def parse(string)
# Build up a list of tokens in an array
token_list = []
# Make a one to one map between token type and the array
token_string = ''

tokenize(string) do |token|
token_list << token
if token == 'digraph'
token_string += 'd'
elsif token =~ %r{ [\{\};] }x
token_string += token
elsif token == '->'
token_string += '>'
elsif
token_string += 'i'
end
end

Btw, note that string << 'i' is more efficient than string += 'i'.

And you can combine the two steps with a slight modified regexp, since you
know what you get during scanning already:

def tokenize(string)
tk=[]
tl=""

string.scan( /(digraph) | ([a-z_][a-z_\d]*) | ([+-]?[.\d]+) | (->) |
([{};])/xi ) do |di,name,num,arr,noise|
if di
tk << di
tl << 'd'
elsif name
tk << name
tl << 'i'
elsif num
tk << num
tl << 'i'
elsif arr
tk << arr
tl << '>'
elsif noise
tk << noise
tl << noise
else
# never here
end
end

[tk, tl]
end

Now you can do

irb(main):144:0> tokens, pattern = tokenize( "digraph a_name {
irb(main):145:1" a -> b;
irb(main):146:1" c -> d -> e;
irb(main):147:1" f -> {a;b;c}
irb(main):148:1" d;
irb(main):149:1" };" )
=> [["digraph", "a_name", "{", "a", "->", "b", "c", "->", "d", "->", "e",
"f", "
->", "{", "a", "b", "c", "}", "d", "}"], "di{i>ii>i>ii>{iii}i}"]
irb(main):150:0> tokens
=> ["digraph", "a_name", "{", "a", "->", "b", "c", "->", "d", "->", "e",
"f", "-
", "{", "a", "b", "c", "}", "d", "}"]
irb(main):151:0> pattern
=> "di{i>ii>i>ii>{iii}i}"
irb(main):152:0>

Rest as before.

Cheers

robert

# Now our example is a string "di{i>i;i>i>i;i>{i;i;i}i;}"
regexp = %r{ ^ d i \{ ((i ; | i (> i)+ ; | i > \{ i (; i)* \})*) \} ;? $ }x
match_data = regexp.match( token_string)
raise "Parse error '#{token_string}' doesn't match '#{regexp.source}'" unless match_data
return [token_list[1], match_data[1], token_list[match_data.begin(1)...match_data.end(1)]]
end

name, edge_string, edge_list = parse(io)

# Now edge_string is i>i;i>i>i;i>{i;i;i}i;
# name is "a_name"
# edge_list is ["a","->","b",";","c",...]


John Carter Phone : (64)(3) 358 6639
Tait Electronics Fax : (64)(3) 359 4632
PO Box 1645 Christchurch Email : (e-mail address removed)
New Zealand

A Million Monkeys can inflict worse things than just Shakespeare on
your system.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top