Don't do this at home

R

Roedy Green

Sun set a very bad example by naming a utility
nativetoascii.

It will convert various encodings to Unicode and back.

It should have been a pair of utilities called something like:

toUnicode and toNative

or NativeToUnicode and UnicodeToNative

ASCII is NOT Unicode!
 
J

Joona I Palaste

Reminds me of when I used to work on I18N and someone would ask me a
question like "Why is the ASCII code for O-umlaut different on a Mac
and a Windows system?". If I was feeling grumpy I'd say "it isn't"
and see if they could figure out the point I was making.

I've still not yet quite forgiven the incident (on a non-technical
newsgroup) where someone wrote the UTF-8 rendition of the copyright
symbol and added "(that's a copyright symbol for the ASCII-impaired)",
as if ASCII was a generic name for the entire concept of an ordered
set of character glyphs. What's really infuriating is that the guy
flamed me for correcting him.
 
A

arne thormodsen

Roedy Green said:
Sun set a very bad example by naming a utility
nativetoascii.

It will convert various encodings to Unicode and back.

It should have been a pair of utilities called something like:

toUnicode and toNative

or NativeToUnicode and UnicodeToNative

ASCII is NOT Unicode!

Reminds me of when I used to work on I18N and someone would ask me a
question like "Why is the ASCII code for O-umlaut different on a Mac
and a Windows system?". If I was feeling grumpy I'd say "it isn't"
and see if they could figure out the point I was making.

--arne
 
M

Michael Borgwardt

Roedy said:
Sun set a very bad example by naming a utility
nativetoascii.

It will convert various encodings to Unicode and back.

No it won't. It will convert various encodings to ASCII with non-ascii
characters represented as ASCII escape sequences representing their Unicode
code.
It should have been a pair of utilities called something like:

toUnicode and toNative

or NativeToUnicode and UnicodeToNative

That would misrepresent what is actually done even worse.

ASCII is NOT Unicode!

"Unicode" isn't a text encoding at all.
 
R

Roedy Green

No it won't. It will convert various encodings to ASCII with non-ascii
characters represented as ASCII escape sequences representing their Unicode
code.

Do you mean unicode-8 or some ad hoc representation?
 
R

Roedy Green

If you meant UTF-8, no. UTF-8 is not confined to ASCII.

It seem to use that when you convert its ascii to "ASCII" encoding
too. I wonder how it would escape accidental \uxxxx if you used it on
a java program for example.
 
M

Michael Borgwardt

Roedy said:
It seem to use that when you convert its ascii to "ASCII" encoding
too.

I don't understand this sentence.
I wonder how it would escape accidental \uxxxx if you used it on
a java program for example.

Since they are ASCII, native2ascii leaves them unchanged. The compiler
will turn them into the corresponding unicode character. It's really
just a special case of character escapes, which all begin with \.
If you want a literal \u0046 in your program, you have to type
\\u0046, just like you have to type \\n to get \n and not a linefeed.
The difference is that the unicode escapes are processed before the
lexical analyzation so that a unicode-escaped linefeed is equivalent
to a linefeed in the source code, not a linefeed character.
 
S

Stanimir Stamenkov

/Roedy Green/:
It seem to use that when you convert its ascii to "ASCII" encoding
too.

It will convert "ANSI" to ASCII. That's it - it will take a text
file, decode it using the system/native encoding and produce plain
ASCII encoded file where characters outside the ASCII repertoire are
replaced with \uXXXX escapes.
 
R

Roedy Green

I don't understand this sentence.

ascii is also the name of an encoding. So you can convert from its
intermediate format to the official ASCII encoding, which seems to use
these /uxxxx things too. That was a surprise. I would have expected ?
or SUB for any exotic character.
 
M

Michael Borgwardt

Roedy said:
ascii is also the name of an encoding.

It's nothing BUT the name of an encoding.
So you can convert from its
intermediate format

Which intermediate format and what "it" would that be?
to the official ASCII encoding, which seems to use
these /uxxxx things too.

No, those escape sequences have nothing to do with the ASCII
encoding except being composed entirely of ASCII characters.
They are defined in the Java Language Specification and
all specification-compliant compilers will interpret them.

The point of these escape sequences and native2ascii is to
create a "normalized" format for Java source code that everyone
can deal with but still allows the full range of Unicode
characters to be used (well, the full range of UCS-2 anyway).
 
R

Roedy Green

The point of these escape sequences and native2ascii is to
create a "normalized" format for Java source code that everyone
can deal with but still allows the full range of Unicode
characters to be used (well, the full range of UCS-2 anyway).

Fine but that is not what you normally expect from a translation to
ASCII. You would expect exotic characters to translate to ? on sub
the way they do for all the other encodings.

This ASCII is not your father's ASCII.
 
M

Michael Borgwardt

Roedy Green said:
Fine but that is not what you normally expect from a translation to
ASCII. You would expect exotic characters to translate to ? on sub
the way they do for all the other encodings.

This ASCII is not your father's ASCII.

I agree that the name of the tool does not adequately describe it's
purpose, is in fact somewhat misleading.
 
D

Dale King

Roedy Green said:
Fine but that is not what you normally expect from a translation to
ASCII. You would expect exotic characters to translate to ? on sub
the way they do for all the other encodings.

This ASCII is not your father's ASCII.

Have you bothered to read the documentation for the native2ascii tool?

http://java.sun.com/j2se/1.4.2/docs/tooldocs/windows/native2ascii.html

It makes it very clear what it does. It says it converts it to
Unicode-encoded characters. The only place the word ASCII appears is in the
name of the tool. The output of the tool (or the input if you put it into
reverse) is ASCII only. So there is nothing misleading about saying it
converts to ASCII.

I fail to see what you are complaining about. They cannot put every bit of
information about what it does into the name of the tool. Calling it
native2unicode would have been very much incorrect. It seems you would only
be happy if they called it native2ascii_with_unicode_escape_sequences, which
I'm afraid is a bit too much to type.

The moral here is that you cannot assume everything about how a tool works
just by the name. Sometimes you have to read the documentation.

Still refuse to put the space after the dashes?
 
R

Roedy Green

It makes it very clear what it does. It says it converts it to
Unicode-encoded characters. The only place the word ASCII appears is in the
name of the tool.

Which is my only complaint, other than the goofy -reverse option.
 
R

Roedy Green

And I still fail to see what there is to complain about. The
tool takes a file in a native encoding and converts it to ASCII.
I don't see what you find wrong with the name since it accurately
reflects what it does.

the name nativeToAscii fails in two respects.

1. The utility does not do a standard "conversion to ASCII". It
converts to an Sun-invented encoding scheme described with ASCII
characters. It is not ASCII any more than Base64 is. Ascii only
represents 128 chars. Sun's encoding represents 64K.

2. nativeToAscii sometimes converts "ascii" to native, the reverse of
what its name implies.
 
P

P.Hill

Roedy said:
1. The utility does not do a standard "conversion to ASCII".

No, agreed, but it does produce a file which contains only ASCII characters,
regardless of what certain sequences of ASCII characters are intended to
represent in the particular "language"/"encoding".
These ASCII files can be safely processed by utilities which only work with
ASCII.
described with ASCII
characters. It is not ASCII any more than Base64 is.

This is an alternative interpretation of what an "ASCII file" should
contain.

RFC 1642 which uses Base64 encoding
http://www.faqs.org/rfcs/rfc1642.html
"Internet mail (STD 11, RFC 822) currently supports only 7-
bit US ASCII as a character set."
[...]
"This document describes a new transformation format of Unicode that
contains only 7-bit ASCII characters"
[...]
"UTF-7 encodes Unicode characters as US-ASCII"

That is the similar usage as the name nativeToAscii suggests.
Ascii only
represents 128 chars. Sun's encoding represents 64K.

The characters in the file are pure ASCII, thus the name.

Sorry that you expect the name to be nativeToAsciiWithOtherCharactersEscaped

-Paul
 
R

Roedy Green

Sorry that you expect the name to be nativeToAsciiWithOtherCharactersEscaped
I would call them toNative and fromNative with an implied interchange
format. Don't confuse the issue by calling it ASCII .It is not. This
is NOT the way you encode those characters in ASCII. All the weird
ones ones should be SUB or ?

ASCII files are human readable things with chars 0..128. I don't
count mime, base64 or sun's Unicode encoding as ASCII even though it
use the ASCII set. It is no more Ascii than Indonsian is English
because they use the same alphabet.
 
D

Dale King

Hello, Roedy Green !
You said:
the name nativeToAscii fails in two respects.

1. The utility does not do a standard "conversion to ASCII". It
converts to an Sun-invented encoding scheme described with ASCII
characters. It is not ASCII any more than Base64 is.

No, that is not a valid comparison. In base64 the value the value
0x41 is merely encoding some combination of bits and does not
signify the letter A as it does in ASCII. The same is not true of
the output of native2ascii. Every numeric value actually means
the same thing as it does in ASCII.
Ascii only
represents 128 chars. Sun's encoding represents 64K.

Not a valid criticism. Each of its 128 values maps to one
abstract symbol, but people have been finding ways to use
combinations of those symbols to represent other things for years
now. For example they might use e^ to signify the letter ê.
Saying that is not ASCII is like saying it isn't ASCII because
you can combine the letters to form words. ASCII only specifies
the meaning of individual symbols and does not limit what meaning
you apply to the combinations of those symbols.
2. nativeToAscii sometimes converts "ascii" to native, the reverse of
what its name implies.

In which case the command would be native2ascii -reverse which
seems fairly self explanatory to me. They could have split that
into 2 programs, but that would be inferior to a single program
in my mind. And remember that it is usually much rarer that the
reverse direction is used. I'm not even sure that the reverse
option was in the original version.
ASCII files are human readable things with chars 0..128.

So is the output of native2ascii.
I don't
count mime, base64 or sun's Unicode encoding as ASCII even though it
use the ASCII set.

You have to seperate the numeric values from the meaning assigned
to specific values. I can't speak for MIME but you are correct
that base64 is not ASCII. Even though it uses the same range of
values it does not assign the same meaning as does ASCII. The
output of native2ascii uses the same numeric range and assigns
the same meaning to those values and is therefore ASCII.
It is no more Ascii than Indonsian is English
because they use the same alphabet.

I don't believe Indonesian actually uses the same alphabet as
English, but that is beside the point.

Once again the issue is not the bit patterns, but the meanings
assigned to those bit patterns. Indonesian is not English because
they do not ascribe the same meanings to the letters. But that
is not the case with native2ascii since it assigns the same
meaning as does ASCII.

A better analogy would be an English text that contained a
foreign word or phrase. Does the text cease to be English because
of this?
This
is NOT the way you encode those characters in ASCII.

No you don't encode them at all in ASCII. You have to encode them
in some other convention on top of ASCII, by using one or more
ASCII characters. However that is still ASCII because the meaning
of each code unit is the same.
All the weird
ones ones should be SUB or ?

I don't see why you think that. Clearly they have to be dropped
or replaced by something else. I don't see what makes one
replacement more natural than another. If you replace them with ?
that is not truly correct since the original file did not have a
? there. Let's say instead that the replacement was instead the
string "{non-ASCII character}", would that make it non ASCII? The
fact that they choose a repleacement that differs by character
and is reversible seems to have no bearing on whether it is
ASCII.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,051
Latest member
CarleyMcCr

Latest Threads

Top