announcing RubyLexer 0.6.0

V

vikkous

At this time, I am pleased to announce the release of RubyLexer 0.6.0,
a standalone lexer of ruby in ruby. RubyLexer attempts to completely
and correctly tokenize all valid ruby 1.8 source code, and it mostly
succeeds. In time, RubyLexer will be able to lex all ruby code. For
now, some newer features are unsupported and there are some extremely
obscure bugs involving strings, but all real world ruby code should be
supported. It is my hope to provide a high-quality lexer for all those
language tools which require one.

RubyLexer is hosted on RubyForge
(http://rubyforge.org/projects/rubylexer/).
Here's where to get the tarball:
http://rubyforge.org/frs/download.php/4191/rubylexer-0.6.0.tar.bz2
 
T

Trans

Hi,

could you describe Ruby lexer a bit more. I know very little about
lexers, so excuse if I ask dumb questions, but... What's the output
look like? How does it compare to other projects like ParseTree? Do you
have any plans for its use?

Thanks,
T.
 
F

Florian Groß

vikkous said:
At this time, I am pleased to announce the release of RubyLexer 0.6.0,
a standalone lexer of ruby in ruby. RubyLexer attempts to completely
and correctly tokenize all valid ruby 1.8 source code, and it mostly
succeeds.

How extendable is this? Would you be able to add new rules to it add
run-time? If it is like that then it could be used for writing Ruby
source code filters which is something that is useful for exploring new
syntax.

I can also contribute a few pieces of code that I think that are hard to
lex properly if you are interested.
 
V

vikkous

A lexer, or tokenizer (they mean the same thing) divides an input
source language into words. It also removes comments and finds the
boundaries of strings. Once this is done, it's much easier to correctly
process the language in a pre-processor or parser. Here's an example.
Given this ruby code:

8+(9 *5)

a correct lexing is something like:

["8","+","(","*","5",")"]

(For lexing purposes, punctuation and operators count as strings as
well.)

The ouput of RubyLexer is actually more complicated than that... for
one thing, there are tokens for whitespace as well. for another, the
individual tokens are not Strings, but Tokens (or subclasses of it, to
be precise), a class defined in RubyLexer. Tokens to respond to to_s in
the expected way, however. (Initially, I did want to have RubyLexer
just return Strings, but it turned out I needed to distinguish
different token types, and the best way to do that is with the type
system.)

ParseTree is a parser, not a lexer. Parsing is the next step in a
compiler pipeline; it determines what order to evaluate to operations
in an expression and solves the difficult problems of precedence and
associativity. (Another way to think of parsers is as the bit that
figures out where the implicit parentheses are inserted into the source
code.) I think that the tool corresponding to RubyLexer is Ripper, but
I don't really know, so don't blame me if I'm wrong.

I have lots of plans, of course, but being only one little programmer
with lots of big ideas, who knows if I'll ever get to them...
 
V

vikkous

How extendable is this? Would you be able to add new rules to it
add run-time?

Ummm... if you're really lucky, maybe. I didn't really have
extensibility in mind. It might be possible to add it, without a lot
of trouble, depending on what you want to extend. So, what do you want
to extend?
If it is like that then it could be used for writing Ruby
source code filters which is something that is useful for exploring
new syntax.

One of the applications I had in mind was to create a lexer family for
ruby-like languages, but that has sort of fallen by the wayside right
now. I still like the idea, but other priorities press at the moment.
I can also contribute a few pieces of code that I think that are hard
to lex properly if you are interested.

Oh! That would be lovely. Weird syntax, obscure syntax, new syntax,
twisted, devious, mutant syntax, I want it all for my menagerie.
 
H

Hal Fulton

vikkous said:
Oh! That would be lovely. Weird syntax, obscure syntax, new syntax,
twisted, devious, mutant syntax, I want it all for my menagerie.

Ha... I'll see if I can dig up anything.

In the meantime, one of my favorites is an expression containing a
string that contains an interpolated expression that contains a
string containing another interpolated expression:

x = "Hi, my name is #{"Slim #{rand(4)>2?"Whitman":"Shady"}"}."


Hal
 
G

gabriele renzi

vikkous ha scritto:
At this time, I am pleased to announce the release of RubyLexer 0.6.0,
a standalone lexer of ruby in ruby. RubyLexer attempts to completely
and correctly tokenize all valid ruby 1.8 source code, and it mostly
succeeds. In time, RubyLexer will be able to lex all ruby code. For
now, some newer features are unsupported and there are some extremely
obscure bugs involving strings, but all real world ruby code should be
supported. It is my hope to provide a high-quality lexer for all those
language tools which require one.

RubyLexer is hosted on RubyForge
(http://rubyforge.org/projects/rubylexer/).
Here's where to get the tarball:
http://rubyforge.org/frs/download.php/4191/rubylexer-0.6.0.tar.bz2

first let me say I think this is cool :)
Anyway, I wonder: isn't something like this included with ruby (irb's
lexer) ?
Care to explain the differences a little?
 
F

Florian Groß

--------------050309060604080800060103
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Ummm... if you're really lucky, maybe. I didn't really have
extensibility in mind. It might be possible to add it, without a lot
of trouble, depending on what you want to extend. So, what do you want
to extend?

One simple example would be adding a ".=" assign-result-of-method-call
operator as in "foo = 'bar'; foo .= reverse"
Oh! That would be lovely. Weird syntax, obscure syntax, new syntax,
twisted, devious, mutant syntax, I want it all for my menagerie.

See attachment.

--------------050309060604080800060103
Content-Type: application/x-ruby;
name="pre.rb"
Content-Transfer-Encoding: base64
Content-Disposition: inline;
filename="pre.rb"

ZXZhbCBcDQpldmFsKFwNCiNwdXRzKFwNCiMhaWYgZGVmaW5lZD8oY2xhc3MgRU5WOjpGUklF
TkRMWTsgZW5kKQ0KJXskXyAgID0NCiAgJSUlDQojeygNCiAoKg0KKT0qJDwNCiApLm1hcCB7
fCRffA0KfiVyIF4jISBtZXNzaW5lc3M/IHN1YiggJXIgI3sNCiUuXi4uLiUgKC4rKSB9ezF9
IG5vbnNlbnNlLCA8PCcgPj4nLg0KICBcMQ0KID4+DQpkZWxldGUoICUlJSA8PCc+PiAgDQop
OicNCikgKSA6ICggKA0KWyBbICMgXSBdDQogIHN1YiggJXINCl4gI3sgICAlcQ0KKA0KLi4g
JQkqXFxzKgkNCn0gICQpDQp4ZW5vbikgeyBAXyA9ICQxLmluc3BlY3QNCiUoICRfIDw8ICNA
XyA8PCAlIyMjI0BfDQogKX0JXQ0KXVsgMCBdKQ0KICApfX0NCiRffSkNCl9fRU5EX18NCiMh
ZW5kDQolKCUocHV0cyAlKEV4ZWN1dGluZyBSdWJ5IGNvZGUuLi4pKSkpDQo=
--------------050309060604080800060103--
 
V

vikkous

first let me say I think this is cool :)
Anyway, I wonder: isn't something like this included with ruby (irb's
lexer) ?
Care to explain the differences a little?

Irb's lexer is not as complete. I can't think of any examples, but when
developing this, I played around with irb quite a bit, trying different
syntaces. Irb would do pretty good most of the time, but every so
often, I'd come up with something that had to be wrapped in eval %() in
order to work in irb...
 
V

vikkous

"Hi, my name is #{"Slim #{rand(4)>2?"Whitman":"Shady"} "}."

Yes, this is the type of thing I'm thinking of! Stretch the language!
Bend it to the breaking point! <Sound of whip cracking>. But you're not
being deviant enough; you didn't break my lexer yet (tho you can never
be too sure with these string interpolations).

Here's how tricky you have to be to fool it:

p "#{<<kekerz}#{"foob"
zimpler
kekerz
}"

Here document header and body in different interpolations... tricky.
 
P

Peter Suk

Irb's lexer is not as complete. I can't think of any examples, but when
developing this, I played around with irb quite a bit, trying different
syntaces. Irb would do pretty good most of the time, but every so
often, I'd come up with something that had to be wrapped in eval %() in
order to work in irb...

Examples?
 
V

vikkous

Florian said:
One simple example would be adding a ".=" assign-result-of-method-call
operator as in "foo = 'bar'; foo .= reverse"

At first, I thought, "This guy is dreaming; my code is just too rigid
to allow extensions of that kind very easily.". But of course, it
wouldn't be too hard for me to special case this one operator in for
you if you wanted to... it'd just be a quick hack in RubyLexer#dot...
in fact, it could be done in a subclass:
[warning: untested code!]

class FlorianRubyLexer < RubyLexer
def dot(ch)
#this is the routine in RubyLexer that handles tokens beginning
with '.'
if readahead(2)=='.='
KeywordToken.new(@file.read(2),@file.pos-2)
else
super
end
end
end

Not too bad for extensibility, eh? I think things look quite hopeful
for your idea, actually....
you do have to know RubyLexer internals to do this kind of thing, but
that's true for any library. And you probably want to add operators
that create no new ambiguities in the language. This one doesn't create
ambiguity, a sign that you've been thinking about this already. Tell me
more of the kind of thing you want, and maybe I'll write more of your
lexer for you.
See attachment.
« pre.rb »

Now that's deviant! Whitespace as a fancy string delimiter... I don't
even know if that's what breaks RubyLexer, but that's sick, man, really
sick.

Ps: what does the code do?
 
V

vikkous

Examples?

I should have written them down, but I didn't. Next time I come across
one, I'll let you know. Some no doubt got into testdata/p.rb (in
rubylexer).
 
G

gabriele renzi

vikkous ha scritto:
Irb's lexer is not as complete. I can't think of any examples, but when
developing this, I played around with irb quite a bit, trying different
syntaces. Irb would do pretty good most of the time, but every so
often, I'd come up with something that had to be wrapped in eval %() in
order to work in irb...

this is what I expected, I just think you should made it clear to casual
users :)
 
F

Florian Groß

vikkous said:
One simple example would be adding a ".=" assign-result-of-method-call
operator as in "foo = 'bar'; foo .= reverse"

At first, I thought, "This guy is dreaming; my code is just too rigid
to allow extensions of that kind very easily.". But of course, it
wouldn't be too hard for me to special case this one operator in for
you if you wanted to... it'd just be a quick hack in RubyLexer#dot...
in fact, it could be done in a subclass:
[warning: untested code!]

class FlorianRubyLexer < RubyLexer
def dot(ch)
#this is the routine in RubyLexer that handles tokens beginning
with '.'
if readahead(2)=='.='
KeywordToken.new(@file.read(2),@file.pos-2)
else
super
end
end
end

Which is exactly what I thought would be a good way of extending. This
looks good.

Another thing that I would be able to make good use of is getting the
next expression, whatever it might be.

Let's say I have this code:

z = if (x + y) * 2 > 2 then
code here
end

It would then be very nice if I could lex until I see the 'if' then say
'give me an atomic expression' which would parse until the 'then' and
then say 'give me an atomic expression' again which would parse until
the 'end'. Basically I don't want to match paired things (parentheses,
do .. end, class definitions etc.) at the transformation level.

Yup, that sample does not introduce any new syntax -- I would like to
transform it to this:

z = if ((x + y) * 2 > 2).true? then
code here
end

Which is why I would need to find a sub-expression.

Also note that just grabbing everything until the next 'then' would not
be good enough:

# Nonsense code, but still valid
if x > if x < 5 then 3 else 2 end then
puts "Good!"
end

If it weren't for that point then IRB's lexer would be a more or less
nifty match already.

Does this sound like something that can be done without too much trouble?

For doing code transformations it is of course also important that you
can turn back the stream of tokens into a String easily. I did this with
IRB's lexer by using the .line_no and .pos methods of tokens, but that
was not too good a match, actually.
Now that's deviant! Whitespace as a fancy string delimiter... I don't
even know if that's what breaks RubyLexer, but that's sick, man, really
sick.

Oh, that is still relatively simple. There's worse stuff happening under
the surface.
Ps: what does the code do?

If you invoke it as ruby -rpre file.rb it will pre-process file.rb
before letting Ruby handle it. It parses simple directives that look
like this:

#!if rand > 0.5 then
}{}{ # Cause a Syntax Error
#!else
puts "Hello World"
#!end

That file would produce a Syntax Error at parse-time half of the time
and output Hello World in the other cases.
"Hello"
1+5
Time.now
#!gsub!(/^>/, "puts")

And that would make '>' at the beginning of a line mean 'output this: '.

It's basically something like the C preprocessor, but in a more Rubyish
manner written in obscure style. I guess it is pretty useless after all.
 
V

vikkous

Florian Groß ha scritto:
Which is exactly what I thought would be a good way of
extending. This looks good.

Everything may not be as simple as this one case was. The fact that the
first example you gave turned out to be pretty easy is encouraging, but
I think we're likely to run into something really nasty before you are
happy.
It would then be very nice if I could lex until I see the 'if' then
say 'give me an atomic expression' which would parse until
the 'then' and then say 'give me an atomic expression' again
which would parse until the 'end'. Basically I don't want to
match paired things (parentheses, do .. end, class definitions
etc.) at the transformation level.

In general, 'get the next expression' is a problem that requires a
parser, not a lexer. Have you looked at ParseTree? Of course you have.

In this case however, you are in luck. Delimited expressions, that
start and end with ( and ), or begin and end, or whatever, are already
discovered by my lexer. (During the development of RubyLexer, I
discovered that it had to be half-a-parser as well, in order to
correctly get all the information that's needed to lex correctly.) The
information you want is already being gathered by RubyLexer, it's just
not available in a public interface. We should negotiate such an
interface since you seem to need it. What you propose, 'get the next
expression', is not one I want to do. RubyLexer does not deal in
abstractions larger than tokens... at least, not on a public level. I
am, however, willing to emit 'advisory' tokens at certain points in the
token stream, (several such types of tokens are being emitted already)
which should allow you to do what we want, if we design it carefully.

On the other hand.... the reason I chose not to emit advisory tokens
for this particular case is that the complimentary tool to RubyLexer is
intended to be Reg, which can find nested pairs of braces and the like
pretty easily. Have you looked at Reg at all? I realize that I only
released it yesterday, and as of yet it's only half-working because
critical features are as yet unimplemented, but I think it might be
just the thing for the types of preprocessors you have in mind.

Reg might not be able to easily tell 'if' the postfix operator from
'if' the value in current RubyLexer output. Since one requires an end
and the other doesn't, that can be troublesome to deal with. 'do' is
also a pain, now that I think of it. All these cases are handled
correctly in RubyLexer, we just have to find an appropriate
(token-based, not expression-based) interface.
Also note that just grabbing everything until the next 'then' would
not be good enough:

# Nonsense code, but still valid
if x > if x < 5 then 3 else 2 end then
puts "Good!"
end

Don't worry about this type of thing. I have these problems well under
control, one way or another.
Does this sound like something that can be done without
too much trouble?
Definitely!

For doing code transformations it is of course also important that
you can turn back the stream of tokens into a String easily. I did
this with IRB's lexer by using the .line_no and .pos methods of
tokens, but that was not too good a match, actually.

So what would be a good match? I don't see why this should be a
problem. My implementation of Token implements to_s, which returns the
ruby code corresponding to the token; ususally, this is exactly the
same as the code that created the token originally. There's also a
offset method, which returns the position of the token in the input
stream, relative to the very beginning. Tokens don't have a #line_no,
but you can get the same information from FileAndLineTokens.

Turning the token stream back into a big string (or file) is esentially
what one of my test programs (tokentest) does. The resulting ruby files
are legal and parse in exactly the same way. I haven't yet shown that
they are really exactly equivalent (but there's not much room for
variation); that will be the next RubyLexer release.
If it weren't for that point then IRB's lexer would be a more or
less nifty match already.
I did this with IRB's lexer by using the .line_no and .pos
methods of tokens, but that was not too good a match, actually.

Wait,,,, so you wrote irb's lexer? One of my wishlist items is to
integrate RubyLexer with irb among others.... how hard do you think
this will be?
Oh, that is still relatively simple. There's worse stuff happening
under the surface.

Well, it was unexpected for me. Much to my embarassment; I thought I
was an expert at this. I must say many elements of this got me very
confused at first, and obviously I never put all the pieces together.
Congratulations.

Ps: I haven't figured out why this breaks RubyLexer yet, but I will.

Pps: putting tricky stuff in eval strings and the like won't break the
lexer (yet). To the lexer, it's just a string.
It's basically something like the C preprocessor, but in a more
Rubyish manner written in obscure style. I guess it is pretty
useless after all.

Not at all. Now that I know what it does, maybe I'll find a use for it,
someday.
 
F

Florian Groß

vikkous said:
Everything may not be as simple as this one case was. The fact that the
first example you gave turned out to be pretty easy is encouraging, but
I think we're likely to run into something really nasty before you are
happy.

Hm, that ought to be not too much of a problem. I'm okay with having a
look at some of the internals for that kind of things.
In general, 'get the next expression' is a problem that requires a
parser, not a lexer. Have you looked at ParseTree? Of course you have.

In this case however, you are in luck. Delimited expressions, that
start and end with ( and ), or begin and end, or whatever, are already
discovered by my lexer. (During the development of RubyLexer, I
discovered that it had to be half-a-parser as well, in order to
correctly get all the information that's needed to lex correctly.) The
information you want is already being gathered by RubyLexer, it's just
not available in a public interface. We should negotiate such an
interface since you seem to need it. What you propose, 'get the next
expression', is not one I want to do. RubyLexer does not deal in
abstractions larger than tokens... at least, not on a public level. I
am, however, willing to emit 'advisory' tokens at certain points in the
token stream, (several such types of tokens are being emitted already)
which should allow you to do what we want, if we design it carefully.

Hm, I am not sure if that is enough for this case. The condition part of
a if or something else will after all not always be surrounded by ( and
) or begin and end or something similar.

Advisory tokens (which would tell me that I am now entering the
condition of if and now leaving it and now entering the action part of
it and so on) might do this. However, you are right in that this is not
usually the task of a lexer. In the past I have frequently had trouble
with the distinction of lexing and parsing in real language parsing --
most languages require you to keep some context for actually tokenizing
them. Ruby, for example, requires that your lexer knows about all kinds
of quoted Strings and where they end and interpolated expressions inside
them. I'm not sure of where to best draw the line so it's probably
better to let you decide.
On the other hand.... the reason I chose not to emit advisory tokens
for this particular case is that the complimentary tool to RubyLexer is
intended to be Reg, which can find nested pairs of braces and the like
pretty easily. Have you looked at Reg at all? I realize that I only
released it yesterday, and as of yet it's only half-working because
critical features are as yet unimplemented, but I think it might be
just the thing for the types of preprocessors you have in mind.

Heh, I didn't realize that you were also the author of that library so I
did not draw the connection. I have, however, marked those two threads
as something I will have to examine. (They are now colored red.)

I'm watching Reg with growing interest -- I'm not sure if I have already
told this to you (I remember telling the author of "BNF-like grammar
specified DIRECTLY in Ruby"), but I have also done something vaguely
similar -- I have done an object-oriented way of constructing and
combining Regular Expressions. What you have done is something better.

I'm especially interested in how the LALR parser, Reg and RubyLexer
might all work together. Any way of getting some sample code? I'm aware
of the fact that this is all subject to change as long as you have not
implemented all the necessary features like look-ahead, but getting a
quick overview would still be nice.
Reg might not be able to easily tell 'if' the postfix operator from
'if' the value in current RubyLexer output. Since one requires an end
and the other doesn't, that can be troublesome to deal with. 'do' is
also a pain, now that I think of it. All these cases are handled
correctly in RubyLexer, we just have to find an appropriate
(token-based, not expression-based) interface.

I would be pretty much okay with the advisory tokens idea -- it sounds
like meta-tokens that tell me about the context.
So what would be a good match? I don't see why this should be a
problem. My implementation of Token implements to_s, which returns the
ruby code corresponding to the token; ususally, this is exactly the
same as the code that created the token originally. There's also a
offset method, which returns the position of the token in the input
stream, relative to the very beginning. Tokens don't have a #line_no,
but you can get the same information from FileAndLineTokens.

This does sound good. Having an offset ought to actually be better than
separate character and line numbers as well.
Wait,,,, so you wrote irb's lexer? One of my wishlist items is to
integrate RubyLexer with irb among others.... how hard do you think
this will be?

Nope, not really. I've just used it out of IRB. Integrating it ought to
be possible, but I'm not sure why that would be necessary.
Well, it was unexpected for me. Much to my embarassment; I thought I
was an expert at this. I must say many elements of this got me very
confused at first, and obviously I never put all the pieces together.
Congratulations.

Ps: I haven't figured out why this breaks RubyLexer yet, but I will.

Good luck. :)
Pps: putting tricky stuff in eval strings and the like won't break the
lexer (yet). To the lexer, it's just a string.

Yup, same for IRB.
 
P

Peter Suk

I'm especially interested in how the LALR parser, Reg and RubyLexer
might all work together. Any way of getting some sample code? I'm
aware of the fact that this is all subject to change as long as you
have not implemented all the necessary features like look-ahead, but
getting a quick overview would still be nice.

I am currently constructing an LALR parser for Ruby using RubyLexer for
the Alumina-VM project. I suspect that RubyLexer is going to make this
much cleaner.

--Peter
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,048
Latest member
verona

Latest Threads

Top