Florian Groß ha scritto:
Which is exactly what I thought would be a good way of
extending. This looks good.
Everything may not be as simple as this one case was. The fact that the
first example you gave turned out to be pretty easy is encouraging, but
I think we're likely to run into something really nasty before you are
happy.
It would then be very nice if I could lex until I see the 'if' then
say 'give me an atomic expression' which would parse until
the 'then' and then say 'give me an atomic expression' again
which would parse until the 'end'. Basically I don't want to
match paired things (parentheses, do .. end, class definitions
etc.) at the transformation level.
In general, 'get the next expression' is a problem that requires a
parser, not a lexer. Have you looked at ParseTree? Of course you have.
In this case however, you are in luck. Delimited expressions, that
start and end with ( and ), or begin and end, or whatever, are already
discovered by my lexer. (During the development of RubyLexer, I
discovered that it had to be half-a-parser as well, in order to
correctly get all the information that's needed to lex correctly.) The
information you want is already being gathered by RubyLexer, it's just
not available in a public interface. We should negotiate such an
interface since you seem to need it. What you propose, 'get the next
expression', is not one I want to do. RubyLexer does not deal in
abstractions larger than tokens... at least, not on a public level. I
am, however, willing to emit 'advisory' tokens at certain points in the
token stream, (several such types of tokens are being emitted already)
which should allow you to do what we want, if we design it carefully.
On the other hand.... the reason I chose not to emit advisory tokens
for this particular case is that the complimentary tool to RubyLexer is
intended to be Reg, which can find nested pairs of braces and the like
pretty easily. Have you looked at Reg at all? I realize that I only
released it yesterday, and as of yet it's only half-working because
critical features are as yet unimplemented, but I think it might be
just the thing for the types of preprocessors you have in mind.
Reg might not be able to easily tell 'if' the postfix operator from
'if' the value in current RubyLexer output. Since one requires an end
and the other doesn't, that can be troublesome to deal with. 'do' is
also a pain, now that I think of it. All these cases are handled
correctly in RubyLexer, we just have to find an appropriate
(token-based, not expression-based) interface.
Also note that just grabbing everything until the next 'then' would
not be good enough:
# Nonsense code, but still valid
if x > if x < 5 then 3 else 2 end then
puts "Good!"
end
Don't worry about this type of thing. I have these problems well under
control, one way or another.
Does this sound like something that can be done without
too much trouble?
Definitely!
For doing code transformations it is of course also important that
you can turn back the stream of tokens into a String easily. I did
this with IRB's lexer by using the .line_no and .pos methods of
tokens, but that was not too good a match, actually.
So what would be a good match? I don't see why this should be a
problem. My implementation of Token implements to_s, which returns the
ruby code corresponding to the token; ususally, this is exactly the
same as the code that created the token originally. There's also a
offset method, which returns the position of the token in the input
stream, relative to the very beginning. Tokens don't have a #line_no,
but you can get the same information from FileAndLineTokens.
Turning the token stream back into a big string (or file) is esentially
what one of my test programs (tokentest) does. The resulting ruby files
are legal and parse in exactly the same way. I haven't yet shown that
they are really exactly equivalent (but there's not much room for
variation); that will be the next RubyLexer release.
If it weren't for that point then IRB's lexer would be a more or
less nifty match already.
I did this with IRB's lexer by using the .line_no and .pos
methods of tokens, but that was not too good a match, actually.
Wait,,,, so you wrote irb's lexer? One of my wishlist items is to
integrate RubyLexer with irb among others.... how hard do you think
this will be?
Oh, that is still relatively simple. There's worse stuff happening
under the surface.
Well, it was unexpected for me. Much to my embarassment; I thought I
was an expert at this. I must say many elements of this got me very
confused at first, and obviously I never put all the pieces together.
Congratulations.
Ps: I haven't figured out why this breaks RubyLexer yet, but I will.
Pps: putting tricky stuff in eval strings and the like won't break the
lexer (yet). To the lexer, it's just a string.
It's basically something like the C preprocessor, but in a more
Rubyish manner written in obscure style. I guess it is pretty
useless after all.
Not at all. Now that I know what it does, maybe I'll find a use for it,
someday.