Don't do this at home

Discussion in 'Java' started by Roedy Green, Apr 29, 2004.

  1. Roedy Green

    Roedy Green Guest

    Sun set a very bad example by naming a utility

    It will convert various encodings to Unicode and back.

    It should have been a pair of utilities called something like:

    toUnicode and toNative

    or NativeToUnicode and UnicodeToNative

    ASCII is NOT Unicode!
    Roedy Green, Apr 29, 2004
    1. Advertisements

  2. I've still not yet quite forgiven the incident (on a non-technical
    newsgroup) where someone wrote the UTF-8 rendition of the copyright
    symbol and added "(that's a copyright symbol for the ASCII-impaired)",
    as if ASCII was a generic name for the entire concept of an ordered
    set of character glyphs. What's really infuriating is that the guy
    flamed me for correcting him.
    Joona I Palaste, Apr 29, 2004
    1. Advertisements

  3. Reminds me of when I used to work on I18N and someone would ask me a
    question like "Why is the ASCII code for O-umlaut different on a Mac
    and a Windows system?". If I was feeling grumpy I'd say "it isn't"
    and see if they could figure out the point I was making.

    arne thormodsen, Apr 29, 2004
  4. No it won't. It will convert various encodings to ASCII with non-ascii
    characters represented as ASCII escape sequences representing their Unicode
    That would misrepresent what is actually done even worse.

    "Unicode" isn't a text encoding at all.
    Michael Borgwardt, Apr 30, 2004
  5. Roedy Green

    Roedy Green Guest

    Do you mean unicode-8 or some ad hoc representation?
    Roedy Green, Apr 30, 2004
  6. Michael Borgwardt, May 3, 2004
  7. Roedy Green

    Roedy Green Guest

    It seem to use that when you convert its ascii to "ASCII" encoding
    too. I wonder how it would escape accidental \uxxxx if you used it on
    a java program for example.
    Roedy Green, May 4, 2004
  8. I don't understand this sentence.
    Since they are ASCII, native2ascii leaves them unchanged. The compiler
    will turn them into the corresponding unicode character. It's really
    just a special case of character escapes, which all begin with \.
    If you want a literal \u0046 in your program, you have to type
    \\u0046, just like you have to type \\n to get \n and not a linefeed.
    The difference is that the unicode escapes are processed before the
    lexical analyzation so that a unicode-escaped linefeed is equivalent
    to a linefeed in the source code, not a linefeed character.
    Michael Borgwardt, May 4, 2004
  9. /Roedy Green/:
    It will convert "ANSI" to ASCII. That's it - it will take a text
    file, decode it using the system/native encoding and produce plain
    ASCII encoded file where characters outside the ASCII repertoire are
    replaced with \uXXXX escapes.
    Stanimir Stamenkov, May 4, 2004
  10. Roedy Green

    Roedy Green Guest

    ascii is also the name of an encoding. So you can convert from its
    intermediate format to the official ASCII encoding, which seems to use
    these /uxxxx things too. That was a surprise. I would have expected ?
    or SUB for any exotic character.
    Roedy Green, May 4, 2004
  11. It's nothing BUT the name of an encoding.
    Which intermediate format and what "it" would that be?
    No, those escape sequences have nothing to do with the ASCII
    encoding except being composed entirely of ASCII characters.
    They are defined in the Java Language Specification and
    all specification-compliant compilers will interpret them.

    The point of these escape sequences and native2ascii is to
    create a "normalized" format for Java source code that everyone
    can deal with but still allows the full range of Unicode
    characters to be used (well, the full range of UCS-2 anyway).
    Michael Borgwardt, May 5, 2004
  12. Roedy Green

    Roedy Green Guest

    Fine but that is not what you normally expect from a translation to
    ASCII. You would expect exotic characters to translate to ? on sub
    the way they do for all the other encodings.

    This ASCII is not your father's ASCII.
    Roedy Green, May 6, 2004
  13. I agree that the name of the tool does not adequately describe it's
    purpose, is in fact somewhat misleading.
    Michael Borgwardt, May 6, 2004
  14. Roedy Green

    Dale King Guest

    Have you bothered to read the documentation for the native2ascii tool?

    It makes it very clear what it does. It says it converts it to
    Unicode-encoded characters. The only place the word ASCII appears is in the
    name of the tool. The output of the tool (or the input if you put it into
    reverse) is ASCII only. So there is nothing misleading about saying it
    converts to ASCII.

    I fail to see what you are complaining about. They cannot put every bit of
    information about what it does into the name of the tool. Calling it
    native2unicode would have been very much incorrect. It seems you would only
    be happy if they called it native2ascii_with_unicode_escape_sequences, which
    I'm afraid is a bit too much to type.

    The moral here is that you cannot assume everything about how a tool works
    just by the name. Sometimes you have to read the documentation.

    Still refuse to put the space after the dashes?
    Dale King, May 7, 2004
  15. Roedy Green

    Roedy Green Guest

    Which is my only complaint, other than the goofy -reverse option.
    Roedy Green, May 7, 2004
  16. Roedy Green

    Roedy Green Guest

    the name nativeToAscii fails in two respects.

    1. The utility does not do a standard "conversion to ASCII". It
    converts to an Sun-invented encoding scheme described with ASCII
    characters. It is not ASCII any more than Base64 is. Ascii only
    represents 128 chars. Sun's encoding represents 64K.

    2. nativeToAscii sometimes converts "ascii" to native, the reverse of
    what its name implies.
    Roedy Green, May 11, 2004
  17. Roedy Green

    P.Hill Guest

    No, agreed, but it does produce a file which contains only ASCII characters,
    regardless of what certain sequences of ASCII characters are intended to
    represent in the particular "language"/"encoding".
    These ASCII files can be safely processed by utilities which only work with
    This is an alternative interpretation of what an "ASCII file" should

    RFC 1642 which uses Base64 encoding
    "Internet mail (STD 11, RFC 822) currently supports only 7-
    bit US ASCII as a character set."
    "This document describes a new transformation format of Unicode that
    contains only 7-bit ASCII characters"
    "UTF-7 encodes Unicode characters as US-ASCII"

    That is the similar usage as the name nativeToAscii suggests.
    The characters in the file are pure ASCII, thus the name.

    Sorry that you expect the name to be nativeToAsciiWithOtherCharactersEscaped

    P.Hill, May 11, 2004
  18. Roedy Green

    Roedy Green Guest

    I would call them toNative and fromNative with an implied interchange
    format. Don't confuse the issue by calling it ASCII .It is not. This
    is NOT the way you encode those characters in ASCII. All the weird
    ones ones should be SUB or ?

    ASCII files are human readable things with chars 0..128. I don't
    count mime, base64 or sun's Unicode encoding as ASCII even though it
    use the ASCII set. It is no more Ascii than Indonsian is English
    because they use the same alphabet.
    Roedy Green, May 12, 2004
  19. Roedy Green

    Dale King Guest

    Hello, Roedy Green !
    No, that is not a valid comparison. In base64 the value the value
    0x41 is merely encoding some combination of bits and does not
    signify the letter A as it does in ASCII. The same is not true of
    the output of native2ascii. Every numeric value actually means
    the same thing as it does in ASCII.
    Not a valid criticism. Each of its 128 values maps to one
    abstract symbol, but people have been finding ways to use
    combinations of those symbols to represent other things for years
    now. For example they might use e^ to signify the letter ê.
    Saying that is not ASCII is like saying it isn't ASCII because
    you can combine the letters to form words. ASCII only specifies
    the meaning of individual symbols and does not limit what meaning
    you apply to the combinations of those symbols.
    In which case the command would be native2ascii -reverse which
    seems fairly self explanatory to me. They could have split that
    into 2 programs, but that would be inferior to a single program
    in my mind. And remember that it is usually much rarer that the
    reverse direction is used. I'm not even sure that the reverse
    option was in the original version.
    So is the output of native2ascii.
    You have to seperate the numeric values from the meaning assigned
    to specific values. I can't speak for MIME but you are correct
    that base64 is not ASCII. Even though it uses the same range of
    values it does not assign the same meaning as does ASCII. The
    output of native2ascii uses the same numeric range and assigns
    the same meaning to those values and is therefore ASCII.
    I don't believe Indonesian actually uses the same alphabet as
    English, but that is beside the point.

    Once again the issue is not the bit patterns, but the meanings
    assigned to those bit patterns. Indonesian is not English because
    they do not ascribe the same meanings to the letters. But that
    is not the case with native2ascii since it assigns the same
    meaning as does ASCII.

    A better analogy would be an English text that contained a
    foreign word or phrase. Does the text cease to be English because
    of this?
    No you don't encode them at all in ASCII. You have to encode them
    in some other convention on top of ASCII, by using one or more
    ASCII characters. However that is still ASCII because the meaning
    of each code unit is the same.
    I don't see why you think that. Clearly they have to be dropped
    or replaced by something else. I don't see what makes one
    replacement more natural than another. If you replace them with ?
    that is not truly correct since the original file did not have a
    ? there. Let's say instead that the replacement was instead the
    string "{non-ASCII character}", would that make it non ASCII? The
    fact that they choose a repleacement that differs by character
    and is reversible seems to have no bearing on whether it is
    Dale King, Apr 15, 2006
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.