question on java lang spec chapter 3.3 (unicode char lexing)

Aryeh M. Friedman · Jan 2, 2013

If I am lexer for Java in a 100% unicode environment (it already uses unicode for all internal representation of text) and 100% of the code that I will be lexing is from that environment do I need still deal with unicode escapes (\uXXXX) in real life [vs. theortically complete lexing]... assume thatno code will be imported from non-unicode environments

Aryeh M. Friedman · Jan 2, 2013

If I am lexer for Java in a 100% unicode environment (it already uses unicode for all internal representation of text) and 100% of the code that I will be lexing is from that environment do I need still deal with unicode escapes (\uXXXX) in real life [vs. theortically complete lexing]... assume that no code will be imported from non-unicode environments

Just a follow up this is for a Java to native (x86) compiler written in Java I am doing for fun (no practical purpose except for practice in compiler writing [not for school or work])

Lew · Jan 2, 2013

If I am lexer for Java in a 100% unicode [sic] environment (it already uses unicode for all internal
representation of text) and 100% of the code that I will be lexing is from that environment do I need still
deal with unicode escapes (\uXXXX) in real life [vs. theortically complete lexing]... assume that no code
will be imported from non-unicode environments

What do you mean "have to deal with"?

If you mean to parse Java source, you have to be able to parse Java source. The JLS is the final
authority on what that constitutes.

Being "in a 100% unicode [sic] environment" (whatever that's supposed to mean) does not excuse
any responsibilities.

Nor does it obviate the need for the occasional "\uXXXX" in source.

However, I don't think the lexer deals with that. Unicode escape sequences are a precompile phenomenon. Everything is substituted before parsing starts.

Roedy Green · Jan 2, 2013

(\uXXXX)

The only places you encounter such escapes are in Java source and
possibly resource bundles.

Other types of escape you run into are like é,
or {

Arne Vajhøj · Jan 3, 2013

If I am lexer for Java in a 100% unicode environment (it already uses
unicode for all internal representation of text) and 100% of the code
that I will be lexing is from that environment do I need still deal
with unicode escapes (\uXXXX) in real life [vs. theortically complete
lexing]... assume that no code will be imported from non-unicode
environments

It will not be a Java lexer if it does not understand
that.

And is it that much effort to implement that you would
rather create a AMF lexer instead?

I suspect that it is easy to implement.

Arne

Arne Vajhøj · Jan 3, 2013

If I am lexer for Java in a 100% unicode [sic] environment (it already uses unicode for all internal
representation of text) and 100% of the code that I will be lexing is from that environment do I need still
deal with unicode escapes (\uXXXX) in real life [vs. theortically complete lexing]... assume that no code
will be imported from non-unicode environments

Click to expand...

What do you mean "have to deal with"?

If you mean to parse Java source, you have to be able to parse Java source. The JLS is the final
authority on what that constitutes.

Being "in a 100% unicode [sic] environment" (whatever that's supposed to mean) does not excuse
any responsibilities.

Nor does it obviate the need for the occasional "\uXXXX" in source.

However, I don't think the lexer deals with that. Unicode escape sequences are a precompile phenomenon. Everything is substituted before parsing starts.

Well - lexing happens before parsing so ...

Arne

Arne Vajhøj · Jan 3, 2013

The only places you encounter such escapes are in Java source and
possibly resource bundles.

Well - since he is writing a lexer for Java then ...

Arne

Lew · Jan 3, 2013

Arne said:
Lew said:

Aryeh said:

If I am lexer for Java in a 100% unicode [sic] environment (it already uses unicode for all internal
representation of text) and 100% of the code that I will be lexing is from that environment do I need still
deal with unicode escapes (\uXXXX) in real life [vs. theortically complete lexing]... assume that no code
will be imported from non-unicode environments

Click to expand...

Click to expand...

What do you mean "have to deal with"?

If you mean to parse Java source, you have to be able to parse Java source. The JLS is the final
authority on what that constitutes.

Click to expand...

Being "in a 100% unicode [sic] environment" (whatever that's supposed tomean) does not excuse
any responsibilities.

Click to expand...

Nor does it obviate the need for the occasional "\uXXXX" in source.

Click to expand...

However, I don't think the lexer deals with that. Unicode escape sequences are a precompile
phenomenon. Everything is substituted before parsing starts.

Click to expand...

Well - lexing happens before parsing so ...

So does writing source code. What's your point?

My point is that the lexer picks up after the substitution of Unicode sequences.
However, my point is wrong, and yours is right.

http://www.docjar.com/html/api/com/sun/tools/javac/parser/Lexer.java.html

Aryeh M. Friedman · Jan 3, 2013

Well - since he is writing a lexer for Java then ...

A little more on the project... while the over all project *IS* for fun a few components may find there way into more serious work related projects but only to be used on code written by me or others on my team... specifically we may use the lexing/parsing component to make the following tools (the actual code generation/etc. of the compilation is currently purely fun [seenote]):

1. Scan for a complete list of classes referenced by a given class (our build system sometimes hiccups on not realizing that when class X calls an instance of class Y and Y has been modified it needs to recompile X {if, and only if, the signature(s) have changed})

2. Do some minor style enforcement like warning (have not decided if it should reject or just warn) if a class/method does not have something that at least looks like a javadoc header comment (/** ... */ is sufficient for this purpose)

Note:

A long term personal project of mine is to write a OS completely from the ground up in a super set of Java (the only addition I see that is needed is some type of "safe" pointer type)... in this case safe being defined as youcan assign a literal address to it but your not allowed to do ptr math on it

Aryeh M. Friedman · Jan 3, 2013

Well - since he is writing a lexer for Java then ...

Click to expand...

A little more on the project... while the over all project *IS* for fun afew components may find there way into more serious work related projects but only to be used on code written by me or others on my team... specifically we may use the lexing/parsing component to make the following tools (the actual code generation/etc. of the compilation is currently purely fun [see note]):

1. Scan for a complete list of classes referenced by a given class (our build system sometimes hiccups on not realizing that when class X calls an instance of class Y and Y has been modified it needs to recompile X {if, andonly if, the signature(s) have changed})

2. Do some minor style enforcement like warning (have not decided if it should reject or just warn) if a class/method does not have something that at least looks like a javadoc header comment (/** ... */ is sufficient for this purpose)

Note:

A long term personal project of mine is to write a OS completely from theground up in a super set of Java (the only addition I see that is needed is some type of "safe" pointer type)... in this case safe being defined as you can assign a literal address to it but your not allowed to do ptr math on it

In case anyone is interested I have some personal notes on the project at http://dt.fnwe.net/a-javacNative/

Arne Vajhøj · Jan 3, 2013

Arne said:
Arne said:

Lew said:

Aryeh M. Friedman wrote:
If I am lexer for Java in a 100% unicode [sic] environment (it already uses unicode for all internal
representation of text) and 100% of the code that I will be lexing is from that environment do I need still
deal with unicode escapes (\uXXXX) in real life [vs. theortically complete lexing]... assume that no code
will be imported from non-unicode environments

Click to expand...

What do you mean "have to deal with"?

If you mean to parse Java source, you have to be able to parse Java source. The JLS is the final
authority on what that constitutes.

Click to expand...

Being "in a 100% unicode [sic] environment" (whatever that's supposed to mean) does not excuse
any responsibilities.

Click to expand...

Nor does it obviate the need for the occasional "\uXXXX" in source.

Click to expand...

However, I don't think the lexer deals with that. Unicode escape sequences are a precompile
phenomenon. Everything is substituted before parsing starts.

Click to expand...

Well - lexing happens before parsing so ...

Click to expand...

So does writing source code. What's your point?

That it being done before parsing does not imply not done by lexer.

My point is that the lexer picks up after the substitution of Unicode sequences.
However, my point is wrong, and yours is right.

http://www.docjar.com/html/api/com/sun/tools/javac/parser/Lexer.java.html

I am not quite sure what that source code snippet shows.

But a lexer is something that converts from a stream of
source code to a stream of tokens.

Given that:
- the source code contains the escape sequences
- escape sequences get treated similar to real unicode
and if we assume that:
- the parser has not duplicated a ton of logic to handle
a unicode token
then the conversion of escape sequences must either happen in
the lexer.

Whether it is a filter in front of the real lexer or more
deeply buried into the lexer is not as easy to say.

Arne

Lew · Jan 3, 2013

Well - since he is writing a lexer for Java then ...

Click to expand...

A little more on the project... while the over all project *IS* for fun afew components may find there way into more serious work related projects but only to be used on code written by me or others on my team... specifically we may use the lexing/parsing component to make the following tools (the actual code generation/etc. of the compilation is currently purely fun [see note]):

And your team can't use a real Java compiler because ... ?

1. Scan for a complete list of classes referenced by a given class (our build system sometimes hiccups on not realizing that when class X calls an instance of class Y and Y has been modified it needs to recompile X {if, andonly if, the signature(s) have changed})

There are standard build systems that handle this. What do you use?

2. Do some minor style enforcement like warning (have not decided if it should reject or just warn) if a class/method does not have something that at least looks like a javadoc header comment (/** ... */ is sufficient for this purpose)

Lint. Findbugs. And more. Why reinvent the wheel?

Note:
A long term personal project of mine is to write a OS completely from theground up in a super set of Java (the only addition I see that is needed is some type of "safe" pointer type)... in this case safe being defined as you can assign a literal address to it but your not allowed to do ptr math on it

Also known as "a JVM"?

Aryeh M. Friedman · Jan 3, 2013

Well - since he is writing a lexer for Java then ...

Click to expand...

A little more on the project... while the over all project *IS* for funa few components may find there way into more serious work related projects but only to be used on code written by me or others on my team... specifically we may use the lexing/parsing component to make the following tools (the actual code generation/etc. of the compilation is currently purely fun [see note]):

Click to expand...

And your team can't use a real Java compiler because ... ?

1. Scan for a complete list of classes referenced by a given class (ourbuild system sometimes hiccups on not realizing that when class X calls aninstance of class Y and Y has been modified it needs to recompile X {if, and only if, the signature(s) have changed})

Click to expand...

There are standard build systems that handle this. What do you use?

Aegis/cook (aegis.sf.net) and the issue is not the build system per se but how most build system deals the split source view aegis provides (which theaegis documentation correctly points out *ALMOST* no build system deals with correctly except for cook)... i.e. every build system makes the assumes that there is a single definition of each compilation unit where aegis makes it so the current version and all previous versions are available (this makes tracking what changes have been made much easier then say cvs/svn/git besides none of them are truly atomic interms of commits {you have to proveyour code works before committing it and this is enforced by the VC system).

Lint. Findbugs. And more. Why reinvent the wheel?

Because there are other style issues that are not enforced (but are propitiatory) that make those unusable

Also known as "a JVM"?

As far I know the JVM can not be directly booted (as in if I turn on my PC it can not boot into the JVM)... neither for performance reasons does it make sense to run a VM at the bottom layer... an other reason is there is a lot of junk in the JRE (like how do you do garbage collection if you do not have some way of the OS allocating mem to a process in the first place)

Aryeh M. Friedman · Jan 3, 2013

Well - since he is writing a lexer for Java then ...

Click to expand...

A little more on the project... while the over all project *IS* for fun a few components may find there way into more serious work related projects but only to be used on code written by me or others on my team... specifically we may use the lexing/parsing component to make the following tools(the actual code generation/etc. of the compilation is currently purely fun [see note]):

Click to expand...

And your team can't use a real Java compiler because ... ?

Click to expand...

Who said we aren't I only said the lexing/parsing components would be used the team the actual code generation is purely for fun.

Aegis/cook (aegis.sf.net) and the issue is not the build system per se but how most build system deals the split source view aegis provides (which the aegis documentation correctly points out *ALMOST* no build system deals with correctly except for cook)... i.e. every build system makes the assumes that there is a single definition of each compilation unit where aegis makes it so the current version and all previous versions are available (thismakes tracking what changes have been made much easier then say cvs/svn/git besides none of them are truly atomic interms of commits {you have to prove your code works before committing it and this is enforced by the VC system).

Also cook is the only build system (besides make) that I have dealt that does not make extremely hard and often incorrect assumptions about your environment for example ant/maven/etc. are a nightmare for anything besides Java(we often have 2 or 3 langs in one project) and very often the assume yourusing a IDE (and often a specific one) where we do all our work from the command line by company policy (we have never found a IDE that correctly handles stuff like making completely standalone executable jars and such)

Arne Vajhøj · Jan 3, 2013

Also cook is the only build system (besides make) that I have dealt
that does not make extremely hard and often incorrect assumptions
about your environment for example ant/maven/etc. are a nightmare for
anything besides Java (we often have 2 or 3 langs in one project)

Ant and maven are designed for Java. They work fine for Java and
Java compatible languages.

They may not work great for completely unrelated languages.

But they can activate other build systems and most other build
systems can activate them.

and
very often the assume your using a IDE (and often a specific one)

ant and maven does not make such assumption.

where we do all our work from the command line by company policy (we
have never found a IDE that correctly handles stuff like making
completely standalone executable jars and such)

All Java IDE's that I know can do that.

So can ant and maven.

Arne

Arne Vajhøj · Jan 3, 2013

As far I know the JVM can not be directly booted (as in if I turn on
my PC it can not boot into the JVM)...

The JVM was not created for writing OS'es but for writing
applications so that is correct.

neither for performance
reasons does it make sense to run a VM at the bottom layer...

Performance should not be a problem.

an
other reason is there is a lot of junk in the JRE (like how do you do
garbage collection if you do not have some way of the OS allocating
mem to a process in the first place)

Again. Java was designed to write applications not OS'es.

The "junk" you are talking about is what makes it useful
for the big majority.

Arne

Aryeh M. Friedman · Jan 3, 2013

All Java IDE's that I know can do that.

Let's see we have tried eclipse, netbeans, bluej, dr. java and a few others and every single one failed to produce jars that can be run without after build changes to the manifest and/or needed libs that came with the IDE

So can ant and maven.

Ant and Maven can agreed

Arne Vajhøj · Jan 3, 2013

the ground up in a super set of Java (the only addition I see that is
needed is some type of "safe" pointer type)... in this case safe
being defined as you can assign a literal address to it but your not
allowed to do ptr math on it

See http://en.wikipedia.org/wiki/Singularity_(operating_system)
for an example of OS in managed language.

Arne

Aryeh M. Friedman · Jan 3, 2013

The JVM was not created for writing OS'es but for writing

applications so that is correct.

In my professional life that's how I use java the comment only pertained to the motivation for writing a native compiler (which is for fun)

Arne Vajhøj · Jan 3, 2013

Let's see we have tried eclipse, netbeans, bluej, dr. java and a few
others and every single one failed to produce jars that can be run
without after build changes to the manifest and/or needed libs that
came with the IDE

It can be done.

Obviously it can also be made not to work.

Maybe you should master a Java IDE before writing an OS in Java.

Arne

Java/J2EE Developer - JOB OPPORTUNITY - Minneapolis, MN	1	Jul 25, 2006
C++, wchar_t, Unicode and all that stuff	3	Dec 23, 2005
Java/OOD SW Developers! Flex Hours, Application Ownership, OO/ C++ or Java Can Learn C#/.NET	2	Mar 9, 2007
Free PDF for Java / J2EE Interview questions	0	Oct 28, 2006
[ANN] JRuby 1.4.0 Released	2	Nov 2, 2009
Download Java / J2EE Interview questions with answers free PDF	5	Dec 5, 2006
Download the JAVA , .NET and SQL Server interview PDF	0	Sep 17, 2006
Download the JAVA , .NET and SQL Server interview with answers	0	Sep 14, 2006

question on java lang spec chapter 3.3 (unicode char lexing)

Aryeh M. Friedman

Aryeh M. Friedman

Lew

Roedy Green

Arne Vajhøj

Arne Vajhøj

Arne Vajhøj

Lew

Aryeh M. Friedman

Aryeh M. Friedman

Arne Vajhøj

Lew

Aryeh M. Friedman

Aryeh M. Friedman

Arne Vajhøj

Arne Vajhøj

Aryeh M. Friedman

Arne Vajhøj

Aryeh M. Friedman

Arne Vajhøj

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads