question on java lang spec chapter 3.3 (unicode char lexing)

Discussion in 'Java' started by Aryeh M. Friedman, Jan 2, 2013.

  1. If I am lexer for Java in a 100% unicode environment (it already uses unicode for all internal representation of text) and 100% of the code that I will be lexing is from that environment do I need still deal with unicode escapes (\uXXXX) in real life [vs. theortically complete lexing]... assume thatno code will be imported from non-unicode environments
     
    Aryeh M. Friedman, Jan 2, 2013
    #1
    1. Advertising

  2. On Wednesday, January 2, 2013 3:20:12 AM UTC-5, Aryeh M. Friedman wrote:
    > If I am lexer for Java in a 100% unicode environment (it already uses unicode for all internal representation of text) and 100% of the code that I will be lexing is from that environment do I need still deal with unicode escapes (\uXXXX) in real life [vs. theortically complete lexing]... assume that no code will be imported from non-unicode environments


    Just a follow up this is for a Java to native (x86) compiler written in Java I am doing for fun (no practical purpose except for practice in compiler writing [not for school or work])
     
    Aryeh M. Friedman, Jan 2, 2013
    #2
    1. Advertising

  3. Aryeh M. Friedman

    Lew Guest

    On Wednesday, January 2, 2013 12:20:12 AM UTC-8, Aryeh M. Friedman wrote:
    > If I am lexer for Java in a 100% unicode [sic] environment (it already uses unicode for all internal
    > representation of text) and 100% of the code that I will be lexing is from that environment do I need still
    > deal with unicode escapes (\uXXXX) in real life [vs. theortically complete lexing]... assume that no code
    > will be imported from non-unicode environments


    What do you mean "have to deal with"?

    If you mean to parse Java source, you have to be able to parse Java source. The JLS is the final
    authority on what that constitutes.

    Being "in a 100% unicode [sic] environment" (whatever that's supposed to mean) does not excuse
    any responsibilities.

    Nor does it obviate the need for the occasional "\uXXXX" in source.

    However, I don't think the lexer deals with that. Unicode escape sequences are a precompile phenomenon. Everything is substituted before parsing starts.

    --
    Lew
     
    Lew, Jan 2, 2013
    #3
  4. Aryeh M. Friedman

    Roedy Green Guest

    On Wed, 2 Jan 2013 00:20:12 -0800 (PST), "Aryeh M. Friedman"
    <> wrote, quoted or indirectly quoted someone
    who said :

    > (\uXXXX)


    The only places you encounter such escapes are in Java source and
    possibly resource bundles.

    Other types of escape you run into are like &eacute;,
    or {
    --
    Roedy Green Canadian Mind Products http://mindprod.com
    Students who hire or con others to do their homework are as foolish
    as couch potatoes who hire others to go to the gym for them.
     
    Roedy Green, Jan 2, 2013
    #4
  5. Aryeh M. Friedman

    Arne Vajhøj Guest

    On 1/2/2013 3:20 AM, Aryeh M. Friedman wrote:
    > If I am lexer for Java in a 100% unicode environment (it already uses
    > unicode for all internal representation of text) and 100% of the code
    > that I will be lexing is from that environment do I need still deal
    > with unicode escapes (\uXXXX) in real life [vs. theortically complete
    > lexing]... assume that no code will be imported from non-unicode
    > environments


    It will not be a Java lexer if it does not understand
    that.

    And is it that much effort to implement that you would
    rather create a AMF lexer instead?

    I suspect that it is easy to implement.

    Arne
     
    Arne Vajhøj, Jan 3, 2013
    #5
  6. Aryeh M. Friedman

    Arne Vajhøj Guest

    On 1/2/2013 2:16 PM, Lew wrote:
    > On Wednesday, January 2, 2013 12:20:12 AM UTC-8, Aryeh M. Friedman wrote:
    >> If I am lexer for Java in a 100% unicode [sic] environment (it already uses unicode for all internal
    >> representation of text) and 100% of the code that I will be lexing is from that environment do I need still
    >> deal with unicode escapes (\uXXXX) in real life [vs. theortically complete lexing]... assume that no code
    >> will be imported from non-unicode environments

    >
    > What do you mean "have to deal with"?
    >
    > If you mean to parse Java source, you have to be able to parse Java source. The JLS is the final
    > authority on what that constitutes.
    >
    > Being "in a 100% unicode [sic] environment" (whatever that's supposed to mean) does not excuse
    > any responsibilities.
    >
    > Nor does it obviate the need for the occasional "\uXXXX" in source.
    >
    > However, I don't think the lexer deals with that. Unicode escape sequences are a precompile phenomenon. Everything is substituted before parsing starts.


    Well - lexing happens before parsing so ...

    Arne
     
    Arne Vajhøj, Jan 3, 2013
    #6
  7. Aryeh M. Friedman

    Arne Vajhøj Guest

    On 1/2/2013 2:17 PM, Roedy Green wrote:
    > On Wed, 2 Jan 2013 00:20:12 -0800 (PST), "Aryeh M. Friedman"
    > <> wrote, quoted or indirectly quoted someone
    > who said :
    >
    >> (\uXXXX)

    >
    > The only places you encounter such escapes are in Java source and
    > possibly resource bundles.


    Well - since he is writing a lexer for Java then ...

    Arne
     
    Arne Vajhøj, Jan 3, 2013
    #7
  8. Aryeh M. Friedman

    Lew Guest

    Arne Vajhøj wrote:
    > Lew wrote:
    >>Aryeh M. Friedman wrote:
    >>> If I am lexer for Java in a 100% unicode [sic] environment (it already uses unicode for all internal
    >>> representation of text) and 100% of the code that I will be lexing is from that environment do I need still
    >>> deal with unicode escapes (\uXXXX) in real life [vs. theortically complete lexing]... assume that no code
    >>> will be imported from non-unicode environments

    >
    >> What do you mean "have to deal with"?
    >>
    >> If you mean to parse Java source, you have to be able to parse Java source. The JLS is the final
    >> authority on what that constitutes.

    >
    >> Being "in a 100% unicode [sic] environment" (whatever that's supposed tomean) does not excuse
    > > any responsibilities.

    >
    >> Nor does it obviate the need for the occasional "\uXXXX" in source.

    >
    >> However, I don't think the lexer deals with that. Unicode escape sequences are a precompile
    >> phenomenon. Everything is substituted before parsing starts.

    >
    > Well - lexing happens before parsing so ...


    So does writing source code. What's your point?

    My point is that the lexer picks up after the substitution of Unicode sequences.
    However, my point is wrong, and yours is right.

    http://www.docjar.com/html/api/com/sun/tools/javac/parser/Lexer.java.html

    --
    Lew
     
    Lew, Jan 3, 2013
    #8

  9. >
    > Well - since he is writing a lexer for Java then ...


    A little more on the project... while the over all project *IS* for fun a few components may find there way into more serious work related projects but only to be used on code written by me or others on my team... specifically we may use the lexing/parsing component to make the following tools (the actual code generation/etc. of the compilation is currently purely fun [seenote]):

    1. Scan for a complete list of classes referenced by a given class (our build system sometimes hiccups on not realizing that when class X calls an instance of class Y and Y has been modified it needs to recompile X {if, and only if, the signature(s) have changed})

    2. Do some minor style enforcement like warning (have not decided if it should reject or just warn) if a class/method does not have something that at least looks like a javadoc header comment (/** ... */ is sufficient for this purpose)

    Note:

    A long term personal project of mine is to write a OS completely from the ground up in a super set of Java (the only addition I see that is needed is some type of "safe" pointer type)... in this case safe being defined as youcan assign a literal address to it but your not allowed to do ptr math on it
     
    Aryeh M. Friedman, Jan 3, 2013
    #9
  10. On Wednesday, January 2, 2013 8:27:21 PM UTC-5, Aryeh M. Friedman wrote:
    > >

    >
    > > Well - since he is writing a lexer for Java then ...

    >
    >
    >
    > A little more on the project... while the over all project *IS* for fun afew components may find there way into more serious work related projects but only to be used on code written by me or others on my team... specifically we may use the lexing/parsing component to make the following tools (the actual code generation/etc. of the compilation is currently purely fun [see note]):
    >
    >
    >
    > 1. Scan for a complete list of classes referenced by a given class (our build system sometimes hiccups on not realizing that when class X calls an instance of class Y and Y has been modified it needs to recompile X {if, andonly if, the signature(s) have changed})
    >
    >
    >
    > 2. Do some minor style enforcement like warning (have not decided if it should reject or just warn) if a class/method does not have something that at least looks like a javadoc header comment (/** ... */ is sufficient for this purpose)
    >
    >
    >
    > Note:
    >
    >
    >
    > A long term personal project of mine is to write a OS completely from theground up in a super set of Java (the only addition I see that is needed is some type of "safe" pointer type)... in this case safe being defined as you can assign a literal address to it but your not allowed to do ptr math on it


    In case anyone is interested I have some personal notes on the project at http://dt.fnwe.net/a-javacNative/
     
    Aryeh M. Friedman, Jan 3, 2013
    #10
  11. Aryeh M. Friedman

    Arne Vajhøj Guest

    On 1/2/2013 8:21 PM, Lew wrote:
    > Arne Vajhøj wrote:
    >> Lew wrote:
    >>> Aryeh M. Friedman wrote:
    >>>> If I am lexer for Java in a 100% unicode [sic] environment (it already uses unicode for all internal
    >>>> representation of text) and 100% of the code that I will be lexing is from that environment do I need still
    >>>> deal with unicode escapes (\uXXXX) in real life [vs. theortically complete lexing]... assume that no code
    >>>> will be imported from non-unicode environments

    >>
    >>> What do you mean "have to deal with"?
    >>>
    >>> If you mean to parse Java source, you have to be able to parse Java source. The JLS is the final
    >>> authority on what that constitutes.

    >>
    >>> Being "in a 100% unicode [sic] environment" (whatever that's supposed to mean) does not excuse
    >>> any responsibilities.

    >>
    >>> Nor does it obviate the need for the occasional "\uXXXX" in source.

    >>
    >>> However, I don't think the lexer deals with that. Unicode escape sequences are a precompile
    >>> phenomenon. Everything is substituted before parsing starts.

    >>
    >> Well - lexing happens before parsing so ...

    >
    > So does writing source code. What's your point?


    That it being done before parsing does not imply not done by lexer.

    > My point is that the lexer picks up after the substitution of Unicode sequences.
    > However, my point is wrong, and yours is right.
    >
    > http://www.docjar.com/html/api/com/sun/tools/javac/parser/Lexer.java.html


    I am not quite sure what that source code snippet shows.

    But a lexer is something that converts from a stream of
    source code to a stream of tokens.

    Given that:
    - the source code contains the escape sequences
    - escape sequences get treated similar to real unicode
    and if we assume that:
    - the parser has not duplicated a ton of logic to handle
    a unicode token
    then the conversion of escape sequences must either happen in
    the lexer.

    Whether it is a filter in front of the real lexer or more
    deeply buried into the lexer is not as easy to say.

    Arne
     
    Arne Vajhøj, Jan 3, 2013
    #11
  12. Aryeh M. Friedman

    Lew Guest

    On Wednesday, January 2, 2013 5:27:21 PM UTC-8, Aryeh M. Friedman wrote:
    > >

    >
    > > Well - since he is writing a lexer for Java then ...

    >
    >
    >
    > A little more on the project... while the over all project *IS* for fun afew components may find there way into more serious work related projects but only to be used on code written by me or others on my team... specifically we may use the lexing/parsing component to make the following tools (the actual code generation/etc. of the compilation is currently purely fun [see note]):


    And your team can't use a real Java compiler because ... ?

    > 1. Scan for a complete list of classes referenced by a given class (our build system sometimes hiccups on not realizing that when class X calls an instance of class Y and Y has been modified it needs to recompile X {if, andonly if, the signature(s) have changed})


    There are standard build systems that handle this. What do you use?

    > 2. Do some minor style enforcement like warning (have not decided if it should reject or just warn) if a class/method does not have something that at least looks like a javadoc header comment (/** ... */ is sufficient for this purpose)


    Lint. Findbugs. And more. Why reinvent the wheel?

    > Note:
    > A long term personal project of mine is to write a OS completely from theground up in a super set of Java (the only addition I see that is needed is some type of "safe" pointer type)... in this case safe being defined as you can assign a literal address to it but your not allowed to do ptr math on it


    Also known as "a JVM"?

    --
    Lew
     
    Lew, Jan 3, 2013
    #12
  13. On Wednesday, January 2, 2013 8:42:57 PM UTC-5, Lew wrote:
    > On Wednesday, January 2, 2013 5:27:21 PM UTC-8, Aryeh M. Friedman wrote:
    >
    > > >

    >
    > >

    >
    > > > Well - since he is writing a lexer for Java then ...

    >
    > >

    >
    > >

    >
    > >

    >
    > > A little more on the project... while the over all project *IS* for funa few components may find there way into more serious work related projects but only to be used on code written by me or others on my team... specifically we may use the lexing/parsing component to make the following tools (the actual code generation/etc. of the compilation is currently purely fun [see note]):

    >
    >
    >
    > And your team can't use a real Java compiler because ... ?
    >
    >
    >
    > > 1. Scan for a complete list of classes referenced by a given class (ourbuild system sometimes hiccups on not realizing that when class X calls aninstance of class Y and Y has been modified it needs to recompile X {if, and only if, the signature(s) have changed})

    >
    >
    >
    > There are standard build systems that handle this. What do you use?


    Aegis/cook (aegis.sf.net) and the issue is not the build system per se but how most build system deals the split source view aegis provides (which theaegis documentation correctly points out *ALMOST* no build system deals with correctly except for cook)... i.e. every build system makes the assumes that there is a single definition of each compilation unit where aegis makes it so the current version and all previous versions are available (this makes tracking what changes have been made much easier then say cvs/svn/git besides none of them are truly atomic interms of commits {you have to proveyour code works before committing it and this is enforced by the VC system).

    >
    >
    >
    > > 2. Do some minor style enforcement like warning (have not decided if itshould reject or just warn) if a class/method does not have something thatat least looks like a javadoc header comment (/** ... */ is sufficient forthis purpose)

    >
    >
    >
    > Lint. Findbugs. And more. Why reinvent the wheel?


    Because there are other style issues that are not enforced (but are propitiatory) that make those unusable

    >
    >
    >
    > > Note:

    >
    > > A long term personal project of mine is to write a OS completely from the ground up in a super set of Java (the only addition I see that is neededis some type of "safe" pointer type)... in this case safe being defined asyou can assign a literal address to it but your not allowed to do ptr mathon it

    >
    >
    >
    > Also known as "a JVM"?


    As far I know the JVM can not be directly booted (as in if I turn on my PC it can not boot into the JVM)... neither for performance reasons does it make sense to run a VM at the bottom layer... an other reason is there is a lot of junk in the JRE (like how do you do garbage collection if you do not have some way of the OS allocating mem to a process in the first place)
    >
    >
    >
    > --
    >
    > Lew
     
    Aryeh M. Friedman, Jan 3, 2013
    #13
  14. On Wednesday, January 2, 2013 8:55:19 PM UTC-5, Aryeh M. Friedman wrote:
    > On Wednesday, January 2, 2013 8:42:57 PM UTC-5, Lew wrote:
    >
    > > On Wednesday, January 2, 2013 5:27:21 PM UTC-8, Aryeh M. Friedman wrote:

    >
    > >

    >
    > > > >

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > > Well - since he is writing a lexer for Java then ...

    >
    > >

    >
    > > >

    >
    > >

    >
    > > >

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > A little more on the project... while the over all project *IS* for fun a few components may find there way into more serious work related projects but only to be used on code written by me or others on my team... specifically we may use the lexing/parsing component to make the following tools(the actual code generation/etc. of the compilation is currently purely fun [see note]):

    >
    > >

    >
    > >

    >
    > >

    >
    > > And your team can't use a real Java compiler because ... ?


    Who said we aren't I only said the lexing/parsing components would be used the team the actual code generation is purely for fun.

    > Aegis/cook (aegis.sf.net) and the issue is not the build system per se but how most build system deals the split source view aegis provides (which the aegis documentation correctly points out *ALMOST* no build system deals with correctly except for cook)... i.e. every build system makes the assumes that there is a single definition of each compilation unit where aegis makes it so the current version and all previous versions are available (thismakes tracking what changes have been made much easier then say cvs/svn/git besides none of them are truly atomic interms of commits {you have to prove your code works before committing it and this is enforced by the VC system).


    Also cook is the only build system (besides make) that I have dealt that does not make extremely hard and often incorrect assumptions about your environment for example ant/maven/etc. are a nightmare for anything besides Java(we often have 2 or 3 langs in one project) and very often the assume yourusing a IDE (and often a specific one) where we do all our work from the command line by company policy (we have never found a IDE that correctly handles stuff like making completely standalone executable jars and such)
     
    Aryeh M. Friedman, Jan 3, 2013
    #14
  15. Aryeh M. Friedman

    Arne Vajhøj Guest

    On 1/2/2013 9:02 PM, Aryeh M. Friedman wrote:
    > On Wednesday, January 2, 2013 8:55:19 PM UTC-5, Aryeh M. Friedman
    >> Aegis/cook (aegis.sf.net) and the issue is not the build system per
    >> se but how most build system deals the split source view aegis
    >> provides (which the aegis documentation correctly points out
    >> *ALMOST* no build system deals with correctly except for cook)...
    >> i.e. every build system makes the assumes that there is a single
    >> definition of each compilation unit where aegis makes it so the
    >> current version and all previous versions are available (this makes
    >> tracking what changes have been made much easier then say
    >> cvs/svn/git besides none of them are truly atomic interms of
    >> commits {you have to prove your code works before committing it and
    >> this is enforced by the VC system).

    >
    > Also cook is the only build system (besides make) that I have dealt
    > that does not make extremely hard and often incorrect assumptions
    > about your environment for example ant/maven/etc. are a nightmare for
    > anything besides Java (we often have 2 or 3 langs in one project)


    Ant and maven are designed for Java. They work fine for Java and
    Java compatible languages.

    They may not work great for completely unrelated languages.

    But they can activate other build systems and most other build
    systems can activate them.

    > and
    > very often the assume your using a IDE (and often a specific one)


    ant and maven does not make such assumption.

    > where we do all our work from the command line by company policy (we
    > have never found a IDE that correctly handles stuff like making
    > completely standalone executable jars and such)


    All Java IDE's that I know can do that.

    So can ant and maven.

    Arne
     
    Arne Vajhøj, Jan 3, 2013
    #15
  16. Aryeh M. Friedman

    Arne Vajhøj Guest

    On 1/2/2013 8:55 PM, Aryeh M. Friedman wrote:
    > On Wednesday, January 2, 2013 8:42:57 PM UTC-5, Lew wrote:
    >> On Wednesday, January 2, 2013 5:27:21 PM UTC-8, Aryeh M. Friedman
    >>> A long term personal project of mine is to write a OS completely
    >>> from the ground up in a super set of Java (the only addition I
    >>> see that is needed is some type of "safe" pointer type)... in
    >>> this case safe being defined as you can assign a literal address
    >>> to it but your not allowed to do ptr math on it

    >>
    >> Also known as "a JVM"?

    >
    > As far I know the JVM can not be directly booted (as in if I turn on
    > my PC it can not boot into the JVM)...


    The JVM was not created for writing OS'es but for writing
    applications so that is correct.

    > neither for performance
    > reasons does it make sense to run a VM at the bottom layer...


    Performance should not be a problem.

    > an
    > other reason is there is a lot of junk in the JRE (like how do you do
    > garbage collection if you do not have some way of the OS allocating
    > mem to a process in the first place)


    Again. Java was designed to write applications not OS'es.

    The "junk" you are talking about is what makes it useful
    for the big majority.

    Arne
     
    Arne Vajhøj, Jan 3, 2013
    #16

  17. > All Java IDE's that I know can do that.


    Let's see we have tried eclipse, netbeans, bluej, dr. java and a few others and every single one failed to produce jars that can be run without after build changes to the manifest and/or needed libs that came with the IDE

    > So can ant and maven.


    Ant and Maven can agreed
     
    Aryeh M. Friedman, Jan 3, 2013
    #17
  18. Aryeh M. Friedman

    Arne Vajhøj Guest

    On 1/2/2013 8:27 PM, Aryeh M. Friedman wrote:
    > the ground up in a super set of Java (the only addition I see that is
    > needed is some type of "safe" pointer type)... in this case safe
    > being defined as you can assign a literal address to it but your not
    > allowed to do ptr math on it


    See http://en.wikipedia.org/wiki/Singularity_(operating_system)
    for an example of OS in managed language.

    Arne
     
    Arne Vajhøj, Jan 3, 2013
    #18

  19. > The JVM was not created for writing OS'es but for writing
    >
    > applications so that is correct.


    In my professional life that's how I use java the comment only pertained to the motivation for writing a native compiler (which is for fun)
     
    Aryeh M. Friedman, Jan 3, 2013
    #19
  20. Aryeh M. Friedman

    Arne Vajhøj Guest

    On 1/2/2013 9:16 PM, Aryeh M. Friedman wrote:
    >> All Java IDE's that I know can do that.

    >
    > Let's see we have tried eclipse, netbeans, bluej, dr. java and a few
    > others and every single one failed to produce jars that can be run
    > without after build changes to the manifest and/or needed libs that
    > came with the IDE


    It can be done.

    Obviously it can also be made not to work.

    Maybe you should master a Java IDE before writing an OS in Java.

    :)

    Arne
     
    Arne Vajhøj, Jan 3, 2013
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Ole Nielsby

    Lexing the ' char

    Ole Nielsby, Nov 2, 2007, in forum: VHDL
    Replies:
    3
    Views:
    426
    diogratia
    Nov 4, 2007
  2. John Carter

    Lexing and Parsing in Ruby.

    John Carter, Nov 19, 2003, in forum: Ruby
    Replies:
    2
    Views:
    188
    Robert Klemme
    Nov 19, 2003
  3. Martin DeMello

    simple lexing/parsing task

    Martin DeMello, Feb 9, 2004, in forum: Ruby
    Replies:
    4
    Views:
    159
    Martin DeMello
    Feb 10, 2004
  4. Chirag Mistry
    Replies:
    6
    Views:
    190
    Ollivier Robert
    Feb 8, 2008
  5. Andrew Chen
    Replies:
    1
    Views:
    226
    David Chelimsky
    Mar 25, 2008
Loading...

Share This Page