Breaking Ruby code into tokens

Hal Fulton · Oct 4, 2003

I've been looking at lex.c and parse.y and parse.c, but it's all rather
over my head.

How might one simply break Ruby code into tokens?

Maybe token isn't the proper term, but I think it is. I've written
very few real parsers in my life.

For example, obviously all keywords and identifiers and punctuation
(such as <<) would be treated as single entities. Strings and
regular expressions would also be treated as such.

I know that Ruby grammar is nontrivial -- for example, by looking at
the code I just now realized that "class <<" is treated as a special
case so that it won't look like a here-doc. Never thought of that
before.

But a full-fledged parser is overkill, too, isn't it? Surely this could
be done in 100 lines of Ruby or so?

Enlighten me...

Hal

daz · Oct 5, 2003

Hal Fulton said:
I've been looking at lex.c and parse.y and parse.c, ...

Pending a correction, lex.c is an unused remnant.
parse.c is ignorable (generated by Yacc from parse.y).
The real ruby lexer is in parse.y (function yylex).

How might one simply break Ruby code into tokens?

Hal

While writing IRB, Keiju ISHITSUKA seems to have taken
the trouble to expose his lexer to other callers.
Thank you.

ruby-lex is a ruby emulation of the interpreter's lexer.
(May have slight differences.)
As part of IRB, it's standard distribution.

I haven't seen examples -- this offering tokenizes itself
but you can change to a script-file target.

#------------------------------------
require 'irb\ruby-lex'

include RubyToken

#File.open('testfile.rb') do |infile| # see: lex.set_input

tree = []
ikeys = [:name,

p, :value, :node]

lex = RubyLex.new
DATA.rewind
lex.set_input(DATA) # (DATA) or (infile)

line = lex.get_readed # read (past tense

while tk = lex.token

tkc = tk.class.to_s.sub(/\ARubyToken::/, '')

tkih = { :tk => tkc,
:line => tk.line_no,
:seek => tk.seek,
:char_no => tk.char_no }

# some tokens have extra attributes.
ikeys.each do |tkk|
tkih[tkk.to_sym] = tk.respond_to?(tkk) && tk.send(tkk)
end

tree << tkih

if tkc === 'TkNL'
# puts line unless line == /\A\s*\Z/ # line sep
line = lex.get_readed # next line
# Note: read line left here otherwise
# position of NL is mis-reported [BUG?].
end
end

tree.each do |tkh|
printf("line %-3d @%3d: %-12s", tkh[:line], tkh[:char_no], tkh[:tk])
printf(" [%s]", tkh[:name]) if tkh[:name]

tkh.each do |k, v|
next unless (ikeys - [:name]).include?(k)
printf(" %s(%s)", k, v) if v
end
puts
puts if tkh[:tk] == 'TkNL'
end

#end # File.open
__END__
#------------------------------------

There may be other methods of interest in:

lib\ruby\1.8\irb\slex.rb
lib\ruby\1.8\irb\ruby-lex.rb
lib\ruby\1.8\irb\ruby-token.rb

daz

Hal Fulton · Oct 5, 2003

daz said:
Pending a correction, lex.c is an unused remnant.
parse.c is ignorable (generated by Yacc from parse.y).
The real ruby lexer is in parse.y (function yylex).

Didn't know lex.c was a leftover.

I know parse.c is generated from parse.y, but I can
read C and can't read yacc.

While writing IRB, Keiju ISHITSUKA seems to have taken
the trouble to expose his lexer to other callers.
Thank you.

irb/ruby-lex is what I've settled on. It works nicely.
(Mauricio or batsman on IRC also pointed me that way.)

Thanks,
Hal

[ANN] antlr3 1.3.1 ( improved ANTLR v3 for Ruby )	0	Jan 7, 2010
[ANN] rubylexer 0.7.1 Released	1	Sep 2, 2008
[ANN] rubylexer 0.7.0 Released	0	Feb 21, 2008
Python -Vs- Ruby: A regexp match to the death!	13	Aug 9, 2010
Question about 2005 Ruby critique...	7	Sep 24, 2007
[ANN] RubyLexer 0.7.7 Released	0	Jan 4, 2010
Help repost to Ruby Dev -- 2 improvements to Ruby Magic	1	Jun 6, 2007
More Summer of Code goodness (please forward)	1	Mar 21, 2007

Breaking Ruby code into tokens

Hal Fulton

daz

Hal Fulton

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads