questions of idiom

Collins · Jun 7, 2010

Hello List

I am relatively new to ruby. I have set myself the problem of writing
a lexical analyzer in ruby to learn some of it's capabilites. I have
pasted the code for that class and for the calling test harness
below. I beg the lists indulgence in several ways

1) has this problem already been solved in a "gem"? I'd love to see
how a more sophisticated rubyist solves it
2) There is object manipulation with which I'm still not comfortable.
In particular, in the buffer manipulation code in the method analyze
makes me unhappy and I'd be happy to receive instructions in a better
way to do it
3) Every lanuage has its idioms. I'm not at all sure that I'm using
the best or most "ruby-like" way of doing certain things. Again I
welcome suggestions.

Thanks in advance

Collins

##### code snippet 1 ######
class Rule
attr_reader :tokID, :re
def initialize(_tokID, _re)
@tokID = _tokID
@re = _re
end

def to_s
self.class.name + ": " + @tokID.to_s + "::= " + @re.to_s
end
end

class Match
attr_reader :rule, :lexeme
def initialize(_r, _s)
@rule = _r
@lexeme = _s
end

def to_s
self.class.name + ": " + @rule.to_s + "\nmatches: " + @lexeme.to_s
end
end

class Lexer
attr_reader :errString
# keep a collection of regular expressions and values to return as
token
# types
# then match text to the longest substring yet seen
def initialize
@rules = Array.new
@buff = String.new
@aFile = nil
@errString = nil
end

def addToken (tokID, re)
if re.class.name == "String"
@rules << Rule.new(tokID, Regexp.new(re))
elsif re.class.name == "Regexp"
@rules << Rule.new(tokID, re)
else
print "unsupported type in addToken: ", re.class.name, "\n"
end
end

def findMatch
maxLexeme, maxMatch = String.new, nil
matchCount, rule2 = 0, nil
@rules.each { |rule|
# loop invariant:
# maxLexeme contains the longest matching prefix of @buff found
so far,
# matchCount contains the number of rules that have matched
maxLexeme,
# maxMatch contains the proposed return value
# rule2 contains a subsequent rule that matches maxLexeme
#
# if rule matches from beginning of @buff AND
# does not match all of @buff AND
# match is longer than previous longest match
# then update maxMatch and maxLexeme and matchCount and rule2
#
# but... we have to avoid matching and keep looking if we make
it to the
# end of @buff with a match active (it could still collect more
# characters) OR if more than one match is still active. If the
end of
# the @buff is also the end of the file then it's ok to match to
the end
#
# TODO: think about prepending an anchor to the regexp to
eliminate the
# false matches (those not to the beginning of the @buff)
#

md = rule.re.match(@buff)
if !md.nil? && md.pre_match.length == 0
if md[0].length == @buff.length && [email protected]?
# @buff is potentially ambiguous and there is more file to parse
return nil
elsif md[0].length > maxLexeme.length
# either matching less than whole buffer or at eof AND
# match is longer than any prior match
matchCount, rule2 = 1, nil
maxLexeme, maxMatch = md[0], Match.new(rule,md[0])
elsif md[0].length == maxLexeme.length
# a subsequent match of equal length has been found
matchCount += 1
rule2 = rule
else
# short match... skip
end
else
# either rule did not match @buff OR
# rule did not match the start of @buff
end
}
if !maxMatch.nil? && matchCount == 1
#return an unambiguous match
return maxMatch
elsif !maxMatch.nil? && matchCount > 1
print "ambiguous: ", maxLexeme, " : ", maxMatch.rule.to_s, " :
",
rule2.to_s, "\n"
return nil
else
# no match was found
return nil
end
end

def analyze
aMatch = findMatch
if !aMatch.nil?
#remove matched text from buff
oldBuff = String.new(@buff)
newBuff = @buff[aMatch.lexeme.length,@buff.length-1]
if oldBuff != aMatch.lexeme + newBuff
puts oldBuff
puts "compare failure!"
puts aMatch.lexeme + newBuff
end
@buff = newBuff
end
return aMatch
end

def parseFile(_name)
@fileName = _name
@aFile = File.new(@fileName, "r")
@aFile.each {|line|
# add lines from file to @buff... after each addition yield as
many
# tokens as possible
@buff += line
# comsume all the tokens from @buff that can be found... when no
more
# can be found analyze will return nil... so we'll get another
line
aMatch = analyze
while !aMatch.nil?
# deliver one <token, lexeme pair> at a time to caller...
# by convention a nil tokID is one about which the caller does not
# care to hear...
yield aMatch.rule.tokID, aMatch.lexeme if !
aMatch.rule.tokID.nil?
aMatch = analyze
end
}
# @buff contains the earliest unmatched text... if @buff is not
empty when
# we finish with the file, this is an error
if [email protected]?
@errString = "error: unmatched text:\n" + @buff[0,[80,
@buff.length].min]
return false
else
@errStrng = "no errors detected\n"
return true
end
end
end

##### code snippet 2 ######

WhiteSpaceToken = 0
CommentToken = 1
QuotedStringToken = 2
WordToken = 3
require "lexer"
l = Lexer.new
l.addToken(nil, Regexp.new("\\s+", Regexp::MULTILINE))
l.addToken(nil, Regexp.new("#.*[\\n\\r]+"))
#l.addToken(QuotedStringToken, Regexp.new('["][^"]*["]',
Regexp::MULTILINE))
l.addToken(QuotedStringToken,'["]((\\\")|[^\\\"])*"')
l.addToken(WordToken,Regexp.new("\\w+"))
foo = l.parseFile("testFile1") { |token, lexeme|
print token.to_s + ":" + lexeme.to_s + "\n"
}
if foo
print "pass!\n"
else
print "fail: " + l.errString + "\n"
end

Robert Klemme · Jun 7, 2010

Hello List

I am relatively new to ruby. I have set myself the problem of writing
a lexical analyzer in ruby to learn some of it's capabilites. I have
pasted the code for that class and for the calling test harness
below. I beg the lists indulgence in several ways

1) has this problem already been solved in a "gem"? I'd love to see
how a more sophisticated rubyist solves it

There are certainly parser and lexer generators for Ruby. I cannot
remember one off the top of my head but you'll likely find one in RAA:

http://raa.ruby-lang.org/search.rhtml?search=lexer
http://raa.ruby-lang.org/search.rhtml?search=parser

2) There is object manipulation with which I'm still not comfortable.
In particular, in the buffer manipulation code in the method analyze
makes me unhappy and I'd be happy to receive instructions in a better
way to do it
3) Every lanuage has its idioms. I'm not at all sure that I'm using
the best or most "ruby-like" way of doing certain things. Again I
welcome suggestions.

Thanks in advance

Collins

##### code snippet 1 ######
class Rule
attr_reader :tokID, :re
def initialize(_tokID, _re)
@tokID = _tokID
@re = _re
end

def to_s
self.class.name + ": " + @tokID.to_s + "::= " + @re.to_s
end
end

There are several things in the code above: we use tok_id instead of
tokID for members and instance variables. Only classes use CamelCase.
Also it seems highly uncommon to start identifiers with an underscore.
An alternative way to create the String would be

def to_s
"#{self.class.name}: #{@tok_id}::= #{@re}"
end

String interpolation implicitly applies #to_s.

Finally you can define that class in one line:

Rule = Struct.new :tok_id, :re

class Match
attr_reader :rule, :lexeme
def initialize(_r, _s)
@rule = _r
@lexeme = _s
end

def to_s
self.class.name + ": " + @rule.to_s + "\nmatches: " + @lexeme.to_s
end
end

class Lexer
attr_reader :errString
# keep a collection of regular expressions and values to return as
token
# types
# then match text to the longest substring yet seen
def initialize
@rules = Array.new

If you are lazy you can as well do

@rules = []

@buff = String.new

We usually simply do

@buff = ""

This also creates a new empty String and is easier to spot.

@aFile = nil
@errString = nil
end

def addToken (tokID, re)
if re.class.name == "String"
@rules<< Rule.new(tokID, Regexp.new(re))
elsif re.class.name == "Regexp"
@rules<< Rule.new(tokID, re)
else
print "unsupported type in addToken: ", re.class.name, "\n"
end
end

def add_token(tok_id, re)
@rules << Rule.new(tok_id,
case re
when String
Regexp.new(re)
when Regexp
re
else
raise ArgumentError, "Neither String nor regexp"
end)
end

"case" works by using the #=== operator which happens do be defined for
classes as kind_of? check. It's also safer to work with class instances
than with class names.

In error cases we throw exceptions and let the someone up the call chain
decide how to deal with the error. In your case you just get output on
the console which might not be appropriate in all cases.

def findMatch
maxLexeme, maxMatch = String.new, nil
matchCount, rule2 = 0, nil
@rules.each { |rule|
# loop invariant:
# maxLexeme contains the longest matching prefix of @buff found
so far,
# matchCount contains the number of rules that have matched
maxLexeme,
# maxMatch contains the proposed return value
# rule2 contains a subsequent rule that matches maxLexeme
#
# if rule matches from beginning of @buff AND
# does not match all of @buff AND
# match is longer than previous longest match
# then update maxMatch and maxLexeme and matchCount and rule2
#
# but... we have to avoid matching and keep looking if we make
it to the
# end of @buff with a match active (it could still collect more
# characters) OR if more than one match is still active. If the
end of
# the @buff is also the end of the file then it's ok to match to
the end
#
# TODO: think about prepending an anchor to the regexp to
eliminate the
# false matches (those not to the beginning of the @buff)
#

md = rule.re.match(@buff)
if !md.nil?&& md.pre_match.length == 0
if md[0].length == @buff.length&& [email protected]?
# @buff is potentially ambiguous and there is more file to parse
return nil
elsif md[0].length> maxLexeme.length
# either matching less than whole buffer or at eof AND
# match is longer than any prior match
matchCount, rule2 = 1, nil
maxLexeme, maxMatch = md[0], Match.new(rule,md[0])
elsif md[0].length == maxLexeme.length
# a subsequent match of equal length has been found
matchCount += 1
rule2 = rule
else
# short match... skip
end
else
# either rule did not match @buff OR
# rule did not match the start of @buff
end
}
if !maxMatch.nil?&& matchCount == 1
#return an unambiguous match
return maxMatch
elsif !maxMatch.nil?&& matchCount> 1
print "ambiguous: ", maxLexeme, " : ", maxMatch.rule.to_s, " :
",
rule2.to_s, "\n"
return nil
else
# no match was found
return nil
end
end

Somehow this method seems a bit lengthy. I did not look too deep into
the details but I'd probably pick a different strategy. First, I'd
anchor expressions (like you suggested in your comment). Then I'd just do

matches = {}

@rules.each do |rule|
m = rule.re.match and matches[rule] = m
end

matches

Now you know that

case matches.size
when 0
# nothing matches any more, take last match and strip
# buffer
when 1
# single match, remember as last match
else
# many matches, continue
end

If you place that in a loop that adds a character to the buffer at a
time and then invokes find_match you can do the evaluation like
indicated above.

def analyze
aMatch = findMatch
if !aMatch.nil?
#remove matched text from buff
oldBuff = String.new(@buff)
newBuff = @buff[aMatch.lexeme.length,@buff.length-1]
if oldBuff != aMatch.lexeme + newBuff
puts oldBuff
puts "compare failure!"
puts aMatch.lexeme + newBuff
end
@buff = newBuff
end
return aMatch
end

def parseFile(_name)
@fileName = _name
@aFile = File.new(@fileName, "r")
@aFile.each {|line|

Better use the block form of File.open() or use File.foreach. That way
you can be sure that the file handle is always properly closed. See

http://blog.rubybestpractices.com/posts/rklemme/001-Using_blocks_for_Robustness.html

I'd also choose a different reading strategy - one character at a time
or fixed buffer width. But I would not read lines

# add lines from file to @buff... after each addition yield as
many
# tokens as possible
@buff += line

@buff << line

is more efficient.

# comsume all the tokens from @buff that can be found... when no
more
# can be found analyze will return nil... so we'll get another
line
aMatch = analyze
while !aMatch.nil?
# deliver one<token, lexeme pair> at a time to caller...
# by convention a nil tokID is one about which the caller does not
# care to hear...
yield aMatch.rule.tokID, aMatch.lexeme if !
aMatch.rule.tokID.nil?
aMatch = analyze
end
}
# @buff contains the earliest unmatched text... if @buff is not
empty when
# we finish with the file, this is an error
if [email protected]?
@errString = "error: unmatched text:\n" + @buff[0,[80,
@buff.length].min]
return false
else
@errStrng = "no errors detected\n"
return true
end
end
end

##### code snippet 2 ######

WhiteSpaceToken = 0
CommentToken = 1
QuotedStringToken = 2
WordToken = 3

I would rather use Symbols as token keys (names). They are similarly
efficient but make the constant definitions superfluous.

require "lexer"
l = Lexer.new
l.addToken(nil, Regexp.new("\\s+", Regexp::MULTILINE))
l.addToken(nil, Regexp.new("#.*[\\n\\r]+"))
#l.addToken(QuotedStringToken, Regexp.new('["][^"]*["]',
Regexp::MULTILINE))
l.addToken(QuotedStringToken,'["]((\\\")|[^\\\"])*"')
l.addToken(WordToken,Regexp.new("\\w+"))
foo = l.parseFile("testFile1") { |token, lexeme|
print token.to_s + ":" + lexeme.to_s + "\n"
}
if foo
print "pass!\n"
else
print "fail: " + l.errString + "\n"
end

There are of course completely different strategies to tackle this.
Lexers usually are built as a DFA or NFA (like every Regexp
implementation uses internally). You would then feed a character at a
time to the FA and derive token types from states.

Also, another option would be to lump all tokens into a single regular
expression with group matches for every token type and analysing
matchign groups, e.g.

re = %r{
(\s+) # white space
| ("(?:\\.|[^"\\])*") # quoted string
}x

File.read(file_name).scan re do |m|
case
when $1
printf "whitespace %p\n", $1
when $2
printf "quote %p\n", $2
else
raise "Cannot be, scanner error"
end
end

Kind regards

robert

Josh Cheek · Jun 8, 2010

[Note: parts of this message were removed to make it a legal post.]

Finally, there is prior art that may interest you. Ruby parser/lexers
include:

racc
treetop
ripper
grammar

If you're interested in Treetop, here is a video of it being presented at a
conference (has a nice live code section). I tried playing with it a while
back, and with a lot of effort was able to get past the basics. But I think
the wall I ultimately ran into has to do with understanding what to do with
it once parsed. Maybe I'll try to get into a compiler class next semester.
Also the docs could be a lot better.

Brian Candler · Jun 8, 2010

Robert said:
def add_token(tok_id, re)
@rules << Rule.new(tok_id,
case re
when String
Regexp.new(re)
when Regexp
re
else
raise ArgumentError, "Neither String nor regexp"
end)
end

or even just this:

def add_token(tok_id, re)
@rules << Rule.new(tok_id, Regexp.new(re))
end

That's not exactly the same since Regexp.new(a_regexp) does actually
create a new regexp object, but AFAICT it's a functionally identical
regexp.
=> false

I made a blockchain and want to make a cryptocurrency, but my code doesn't verify hash of each block	2	Jun 2, 2024
Need for speed -> a C extension?	27	Apr 18, 2011
[array & regexp] Development - works, production not - why?	10	Aug 22, 2009
Bug with end of string characters in regex?	5	Jan 13, 2010
Ruby 1.8.4 behaviour - 2 questions	10	Feb 9, 2006
[SUMMARY] Statistician II (#168)	1	Jul 10, 2008
Errors on REXML reading an HTML.	1	Dec 24, 2010
delete the browser cache of IE through ruby	0	Oct 20, 2010

questions of idiom

Collins

Robert Klemme

Josh Cheek

Brian Candler

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads