How to scan Java source texts?

Discussion in 'Java' started by Stefan Ram, Jun 11, 2013.

  1. Stefan Ram

    Stefan Ram Guest

    I'd like to scan Java source texts, printing one token per line.

    I thought it might be possible with the compiler API, and
    have read that it can return an AST, but I do not know how
    to just obtain the tokens from the source code AST.

    I am able to write a scanner for Java myself, but this would
    take days. So I would like to shortcut it by using a Java SE
    (with JDK) call. (I would not like to use a third-party
    library, because when I use the Java SE compiler API, I can
    be sure that this will be up-to-date with future Java-Versions.)

    So, the best solution would be a short program getting this
    information out of the Java compiler API. But I cannot find
    an example for this in the web.

    What does not seem to work is:

    public class Main
    { public static void main( final java.lang.String[] args )throws java.io.IOException
    { final java.io.File javaFile = new java.io.File( "Main.java" );
    final java.io.FileReader file = new java.io.FileReader( javaFile );
    final java.io.StreamTokenizer streamTokenizer = new java.io.StreamTokenizer( file );
    for( int i; true; )
    { i = streamTokenizer.nextToken();
    if( i == java.io.StreamTokenizer.TT_EOF )break;
    java.lang.System.out.println( streamTokenizer.sval ); }}}

    Still, this gives the idea of what I want to accomplish.

    For example, the scanner should decompose:

    a+=b +"c\"d/*e"/*f*/
    +g;

    into

    a
    +=
    b
    +
    "c\"d/*e"
    /*f*/
    +
    g
    ;

    (the comment »/*f*/« can as well be deleted; also, there is
    no need for any further information, such as token types.)
     
    Stefan Ram, Jun 11, 2013
    #1
    1. Advertising

  2. Stefan Ram

    Stefan Ram Guest

    -berlin.de (Stefan Ram) writes:
    >I am able to write a scanner for Java myself, but this would
    >take days. So I would like to shortcut it by using a Java SE
    >(with JDK) call. (I would not like to use a third-party


    It might not be easy to get this right. For example, a
    well-known popular source-code indenter did format the
    several thousand lines of my Java project well, except for a
    single case, where the source text »a=4.436e+3« was splitted
    with a line-break at the wrong place as something like

    a=4.436e
    +3
     
    Stefan Ram, Jun 11, 2013
    #2
    1. Advertising

  3. Stefan Ram

    markspace Guest

    On 6/11/2013 11:54 AM, Stefan Ram wrote:
    > -berlin.de (Stefan Ram) writes:
    >> I am able to write a scanner for Java myself, but this would
    >> take days. So I would like to shortcut it by using a Java SE
    >> (with JDK) call. (I would not like to use a third-party

    >
    > It might not be easy to get this right. For example, a



    No it's not. I recommend a third party library. Antlr has a Java
    syntax already worked out. There's also other dedicated Java parsers.

    Note you're talking about two things here. Lexing and parsing. A lexer
    breaks text up into tokens, a parser decides how to interpret the
    result. Parsers traditionally have a lot more contextual information,
    whereas lexers are just simpler state machines that break up text.
     
    markspace, Jun 11, 2013
    #3
  4. Stefan Ram

    Jeff Higgins Guest

    On 06/11/2013 12:26 PM, Stefan Ram wrote:
    > I'd like to scan Java source texts, printing one token per line.


    Do you mean these tokens:
    <http://docs.oracle.com/javase/specs/jls/se7/html/jls-3.html#jls-3.5>

    > I thought it might be possible with the compiler API, and
    > have read that it can return an AST, but I do not know how
    > to just obtain the tokens from the source code AST.


    An AST is built from the tokens above.

    [snip]
     
    Jeff Higgins, Jun 11, 2013
    #4
  5. Stefan Ram

    Stefan Ram Guest

    Jeff Higgins <> writes:
    >On 06/11/2013 12:26 PM, Stefan Ram wrote:
    >>I'd like to scan Java source texts, printing one token per line.

    >Do you mean these tokens:
    ><http://docs.oracle.com/javase/specs/jls/se7/html/jls-3.html#jls-3.5>


    Yes.

    >>I thought it might be possible with the compiler API, and
    >>have read that it can return an AST, but I do not know how
    >>to just obtain the tokens from the source code AST.

    >An AST is built from the tokens above.


    Yes. That's why the compiler still might have a copy of
    the tokens lying around somewhere or might have a method
    to get the next token. I just can't find such a method.
     
    Stefan Ram, Jun 11, 2013
    #5
  6. Stefan Ram

    Jeff Higgins Guest

    On 06/11/2013 05:02 PM, Stefan Ram wrote:
    > Jeff Higgins <> writes:
    >> On 06/11/2013 12:26 PM, Stefan Ram wrote:
    >>> I'd like to scan Java source texts, printing one token per line.

    >> Do you mean these tokens:
    >> <http://docs.oracle.com/javase/specs/jls/se7/html/jls-3.html#jls-3.5>

    >
    > Yes.
    >
    >>> I thought it might be possible with the compiler API, and
    >>> have read that it can return an AST, but I do not know how
    >>> to just obtain the tokens from the source code AST.

    >> An AST is built from the tokens above.

    >
    > Yes. That's why the compiler still might have a copy of
    > the tokens lying around somewhere or might have a method
    > to get the next token. I just can't find such a method.
    >

    I suspect, but don't know, that these tokens may have lost some
    of the information associated with their being 'InputElements'
    by the time the AST is constructed. It shouldn't be too hard
    to find a Java lexer that will output as you request.
    I'll look around when I have a little more time.
     
    Jeff Higgins, Jun 12, 2013
    #6
  7. Stefan Ram

    Jeff Higgins Guest

    On 06/11/2013 10:07 PM, Jeff Higgins wrote:
    > On 06/11/2013 05:02 PM, Stefan Ram wrote:
    >> Jeff Higgins <> writes:
    >>> On 06/11/2013 12:26 PM, Stefan Ram wrote:
    >>>> I'd like to scan Java source texts, printing one token per line.
    >>> Do you mean these tokens:
    >>> <http://docs.oracle.com/javase/specs/jls/se7/html/jls-3.html#jls-3.5>

    >>
    >> Yes.
    >>
    >>>> I thought it might be possible with the compiler API, and
    >>>> have read that it can return an AST, but I do not know how
    >>>> to just obtain the tokens from the source code AST.
    >>> An AST is built from the tokens above.

    >>
    >> Yes. That's why the compiler still might have a copy of
    >> the tokens lying around somewhere or might have a method
    >> to get the next token. I just can't find such a method.
    >>

    > I suspect, but don't know, that these tokens may have lost some
    > of the information associated with their being 'InputElements'
    > by the time the AST is constructed. It shouldn't be too hard
    > to find a Java lexer that will output as you request.
    > I'll look around when I have a little more time.
    >


    From OpenJDK:

    package com.sun.tools.javac.parser;

    /** The lexical analyzer maps an input stream consisting of
    * ASCII characters and Unicode escapes into a token sequence.
    *
    * <p><b>This is NOT part of any supported API.
    * If you write code that depends on this, you do so at your own risk.
    * This code and its internal interfaces are subject to change or
    * deletion without notice.</b>
    */
    public class Scanner implements Lexer {
     
    Jeff Higgins, Jun 12, 2013
    #7
  8. Stefan Ram

    Jeff Higgins Guest

    On 06/12/2013 04:04 AM, Jeff Higgins wrote:
    >
    > From OpenJDK:
    >

    <http://openjdk.java.net/groups/compiler>
     
    Jeff Higgins, Jun 12, 2013
    #8
  9. Stefan Ram

    Jeff Higgins Guest

    On 06/12/2013 04:15 AM, Jeff Higgins wrote:
    > On 06/12/2013 04:04 AM, Jeff Higgins wrote:
    >>
    >> From OpenJDK:
    >>

    > <http://openjdk.java.net/groups/compiler>

    It turns out to be surprisingly easy to build javac.
    It shouldn't be too hard to add a commandline switch -tokens
    and the requisite code to output tokens one per line as
    they appear using the Lexer interface.
    I don't see a way to do what you want using the existing API.
     
    Jeff Higgins, Jun 12, 2013
    #9
  10. Stefan Ram

    Stefan Ram Guest

    Jeff Higgins <> writes:
    >I don't see a way to do what you want using the existing API.


    Thanks for your remarks!, which helped me
    to find out that it can be done, once one
    is willing to use the »com.sun....«-classes,
    such as »Scanner«. »tools.jar« needs to be in
    the classpath for this.

    Now, there indeed is the risk that these classes
    will change in future JDK versions. But still
    I estimate them to be more stable than some
    third-party libraries. For example, for the same
    purpose I used a third-party program before that
    now has not been adapted to Java >= 1.5, so that I
    now needed to find some means to accomplish this
    for Java >= 1.5.
     
    Stefan Ram, Jun 12, 2013
    #10
  11. Stefan Ram

    Jeff Higgins Guest

    On 06/12/2013 12:44 PM, Stefan Ram wrote:
    > Jeff Higgins <> writes:
    >> I don't see a way to do what you want using the existing API.

    >
    > Thanks for your remarks!, which helped me
    > to find out that it can be done, once one
    > is willing to use the »com.sun....«-classes,
    > such as »Scanner«. »tools.jar« needs to be in
    > the classpath for this.
    >
    > Now, there indeed is the risk that these classes
    > will change in future JDK versions. But still
    > I estimate them to be more stable than some
    > third-party libraries. For example, for the same
    > purpose I used a third-party program before that
    > now has not been adapted to Java >= 1.5, so that I
    > now needed to find some means to accomplish this
    > for Java >= 1.5.
    >

    Well, right there under my nose! :-O :))
    It was fun building my own compiler though!
    Maybe a new language to go with it Jeffa!
     
    Jeff Higgins, Jun 12, 2013
    #11
  12. Stefan Ram

    Roedy Green Guest

    On 11 Jun 2013 16:26:02 GMT, -berlin.de (Stefan Ram)
    wrote, quoted or indirectly quoted someone who said :

    > I'd like to scan Java source texts, printing one token per line.


    You mean Java source code? I wrote a finite state machine parser for
    Java Snippets (i.e. incomplete Java and Java with syntax errors) with
    the intention of classifying each token and printing it out in a
    special colour and font.

    The source is available at http://mindprod.com/products1.html#JDISPLAY
    the class of most interest would be com.mindprod.jprep.JavaState

    You could use it exactly as is. It creates binary token files.
    All you would need to do is write a reader for the token file, and
    display each token one per line ignoring most of the information
    encoded in the token type.

    --
    Roedy Green Canadian Mind Products http://mindprod.com
    Getting information off the Internet is
    like taking a drink from a fire hydrant.
    ~ Mitch Kapor 1950-11-01
     
    Roedy Green, Jun 12, 2013
    #12
  13. Stefan Ram

    Jeff Higgins Guest

    On 06/12/2013 12:44 PM, Stefan Ram wrote:
    > Jeff Higgins <> writes:
    >> I don't see a way to do what you want using the existing API.

    >
    > Thanks for your remarks!, which helped me
    > to find out that it can be done, once one
    > is willing to use the »com.sun....«-classes,
    > such as »Scanner«. »tools.jar« needs to be in
    > the classpath for this.


    The only problem I see now is gaining access to the protected
    Scanner constructor outside of the com.sun.tools.javac.parser package.

    Maybe I'll try extending my newly built javac
    to include a -tokenize extension as above.


    > Now, there indeed is the risk that these classes
    > will change in future JDK versions. But still
    > I estimate them to be more stable than some
    > third-party libraries. For example, for the same
    > purpose I used a third-party program before that
    > now has not been adapted to Java >= 1.5, so that I
    > now needed to find some means to accomplish this
    > for Java >= 1.5.
    >
     
    Jeff Higgins, Jun 12, 2013
    #13
  14. Stefan Ram

    Stefan Ram Guest

    Jeff Higgins <> writes:
    >The only problem I see now is gaining access to the protected
    >Scanner constructor outside of the com.sun.tools.javac.parser package.


    You need to use a factory method of a ScannerFactory, and to
    get that, you need to use yet another factory-like method of
    ScannerFactory, which needs a Context, but this time you can
    use Context's default constructor.
     
    Stefan Ram, Jun 12, 2013
    #14
  15. Stefan Ram

    Jeff Higgins Guest

    On 06/12/2013 02:45 PM, Stefan Ram wrote:
    > Jeff Higgins <> writes:
    >> The only problem I see now is gaining access to the protected
    >> Scanner constructor outside of the com.sun.tools.javac.parser package.

    >
    > You need to use a factory method of a ScannerFactory, and to
    > get that, you need to use yet another factory-like method of
    > ScannerFactory, which needs a Context, but this time you can
    > use Context's default constructor.
    >

    ScannerFactory.instance(new Context()).newScanner(args[0], true);
    Wonderful! Thanks.
     
    Jeff Higgins, Jun 12, 2013
    #15
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. lrantisi
    Replies:
    2
    Views:
    408
    RedGrittyBrick
    Nov 10, 2006
  2. lrantisi
    Replies:
    0
    Views:
    279
    lrantisi
    Nov 9, 2006
  3. Coverity Scan of open source

    , Mar 13, 2006, in forum: C Programming
    Replies:
    2
    Views:
    282
    David Bolt
    Mar 13, 2006
  4. Replies:
    2
    Views:
    339
    shterke
    Apr 30, 2007
  5. Sam Kong
    Replies:
    12
    Views:
    252
    William Park
    Jun 3, 2005
Loading...

Share This Page