Java Newbie Question: Character Sets, Unicode, et al

Discussion in 'Java' started by BLG, Oct 17, 2003.

  1. BLG

    BLG Guest

    Greetings!

    I am studying several books on the Java programming language and have
    come across references to the fact that the JDK uses Unicode as the
    default character set. I am not convinced that I fully comprehend
    what the author is telling me.

    I understand that Unicode is a 16-bit character set and that true
    ASCII is a 7-bit representation. When I look at my source files in a
    hex editor, they appear to be extended ASCII 8-bit format (which I
    assume is the Windows default for a text file). OK - I assume then
    that the JRE uses Unicode character sets, but javac uses some 8-bit
    character set. Is this correct?

    But beyond that, should I even care what the character set is?
    Assuming, of course, internationalization is not a priority for me.

    Also, how do I determine what character set Windows is using? How do
    I change character sets in Windows?

    And lastly, what is the relationship between a character set and a
    font?

    I hope these questions aren't too off the wall. I am trying to
    clarify in my mind this character set concept. In the past, my only
    concern has been ASCII vs EBCDIC.

    Regards!
     
    BLG, Oct 17, 2003
    #1
    1. Advertising

  2. BLG

    Chris Smith Guest

    BLG wrote:
    > I understand that Unicode is a 16-bit character set and that true
    > ASCII is a 7-bit representation.


    Actually, that's a more accurate statement than most people would have
    made. If you want to be really picky, though, I'd make one correction:
    Unicode and ASCII are both character sets. ASCII is *also* a character
    encoding (which may be what you meant by representation), which Unicode
    is not (instead, there are several common encodings for the Unicode
    character set, including UTF-8, UTF-16LE and UTF-16BE).

    > When I look at my source files in a
    > hex editor, they appear to be extended ASCII 8-bit format (which I
    > assume is the Windows default for a text file).


    No. There is no such thing as "extended ASCII 8-bit format" as some
    specific entity. There are, actually, quite a large number of different
    8-bit character encodings, including Cp1252, ISO8859-1, ISO8859-2,
    ISO8859-3, and so on and so forth, and practically all of them are
    different extensions to ASCII. ASCII itself, as a character encoding,
    is in fact an 8-bit encoding as well, with the high-order bit always
    being set to zero.

    > OK - I assume then
    > that the JRE uses Unicode character sets, but javac uses some 8-bit
    > character set. Is this correct?


    Not necessarily. The javac compiler uses the platform default character
    encoding. What that is depends on what platform you're developing on.

    More to the point, though, you're looking to the wrong place for that
    answer. The javac utility is a *consumer* of your source files; it
    doesn't create them. The format of your source files comes from
    whatever tool you've used to write them, and I don't know what tool that
    is. You're fortunate that it happens to be compatible with what javac
    expects to read (which will generally happen if you stick to ASCII
    characters in your source code, but can be a problem if not)

    > But beyond that, should I even care what the character set is?


    Sure. If you expect to write robust character-based I/O code, that is
    (in any language, not just Java).

    > Assuming, of course, internationalization is not a priority for me.


    That's not necessarily relevant. Encodings vary between platforms,
    languages and language configurations, and applications, among other
    things. They are often specified in protocol and file format
    descriptions. You don't have to be doing i18n code to care about the
    definition of a character encoding.

    > Also, how do I determine what character set Windows is using? How do
    > I change character sets in Windows?


    Actually, I have no idea. I've never needed to do it.

    > And lastly, what is the relationship between a character set and a
    > font?


    A font provides glyphs (visual appearances) for some set of characters.
    The relationship, I suppose, is that if you want to reliably display
    content in a certain character set, your font had better have the
    appropriate glyphs for at least the common characters in that character
    set. In Java, fonts map their glyphs directly to Unicode characters, so
    there's no direct relationship between the smaller character sets like
    ASCII and fonts.

    --
    www.designacourse.com
    The Easiest Way to Train Anyone... Anywhere.

    Chris Smith - Lead Software Developer/Technical Trainer
    MindIQ Corporation
     
    Chris Smith, Oct 17, 2003
    #2
    1. Advertising

  3. BLG wrote:
    > I understand that Unicode is a 16-bit character set and that true
    > ASCII is a 7-bit representation.


    Actually, Unicode is not really a character set in the way ASCII is,
    and it is not restricted to 16 bits.

    Unicode is a standard that assigns glyphs (characters) to numeric codes.
    How these codes are concretely represented as bytes is what an encoding
    or charset specifies, which is what ASCII is. There are encodings where
    the number of bits used varies depending on each character, like UTF-8.
    There are even stateful encodings.

    > When I look at my source files in a
    > hex editor, they appear to be extended ASCII 8-bit format (which I
    > assume is the Windows default for a text file).


    Namely Windows Codepage 1252, which is nearly the same as
    ISO-8859-1, aka Latin 1, the most common encoding for western
    European languages.

    > OK - I assume then
    > that the JRE uses Unicode character sets, but javac uses some 8-bit
    > character set. Is this correct?


    Nearly. How the JRE internally represents Strings is not really
    specified, but the usualy way is to use 16bit per character
    in a straightforward way. javac, on the other hand, uses the
    platform standard encoding (unless otherwise specified on the
    command line), with an additional capability to use unicode
    escape sequences (\Uxxxx), when reading in source files. The
    class files contain Strings encoded as UTF-8.

    > But beyond that, should I even care what the character set is?
    > Assuming, of course, internationalization is not a priority for me.


    Yes, it still is important when writing text out to or reading from
    from a file or network socket. It's quite likely that at some point
    you'll use *some* non-ASCII character, and in fact it is not even
    guaranteed that all encodings represent even pure ASCII text
    identically.

    > Also, how do I determine what character set Windows is using?


    More recent Windows versions (since 2000 I think) also use Unicode
    internally as far as possible, but older applications that can't
    use a "traditional encoding" that differs between languages.
    This is the platform default encoding.
    In Java, it's a System property, file.encoding or some such.

    > How do I change character sets in Windows?


    There's an option in the country&language settings somwhere that
    changes the default encoding used for older apps.

    > And lastly, what is the relationship between a character set and a
    > font?


    An encoding defines relationships between numeric codes or byte
    representations thereof and glyphs. A font defines how the glyphs
    are drawn on the screen. Different abstract glyphs can be (and
    sometimes are) assigned the same shape in a font, and nearly all
    fonts contain only shapes for a subset of the glyphs defined in
    Unicode.
     
    Michael Borgwardt, Oct 18, 2003
    #3
  4. BLG

    Roedy Green Guest

    On Sat, 18 Oct 2003 01:08:14 +0200, Michael Borgwardt
    <> wrote or quoted :

    >Unicode is a standard that assigns glyphs (characters) to numeric codes.
    >How these codes are concretely represented as bytes is what an encoding
    >or charset specifies, which is what ASCII is. There are encodings where
    >the number of bits used varies depending on each character, like UTF-8.
    >There are even stateful encodings.


    There is only one way you can encode ASCII as bytes, but there are
    several variants for encoding Unicode with combinations of big/little
    endian, marked/unmarked, 8-bit/16-bit encoding.

    see http://mindprod.com/jgloss/encoding.html


    --
    Canadian Mind Products, Roedy Green.
    Coaching, problem solving, economical contract programming.
    See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
     
    Roedy Green, Oct 18, 2003
    #4
  5. BLG

    Guest

    se (BLG) wrote:

    >I understand that Unicode is a 16-bit character set and that true
    >ASCII is a 7-bit representation.


    That is incorrect.

    This is a recent article on unicode that serves as a good introduction:

    http://www.joelonsoftware.com/articles/Unicode.html
     
    , Oct 18, 2003
    #5
  6. BLG

    Roedy Green Guest

    On Sat, 18 Oct 2003 12:23:56 -0500, wrote or
    quoted :

    >>I understand that Unicode is a 16-bit character set and that true
    >>ASCII is a 7-bit representation.

    >
    >That is incorrect.


    Unicode is a 16 bit character set allowing 64K different glyphs/codes.


    ASCII is a 7-bit character set allowing 128 different glyphs/codes.

    ASCII is written in octets, usually with the high bit off.

    Unicode is written many different ways. See
    http://mindprod.com/jgloss/encoding.html


    --
    Canadian Mind Products, Roedy Green.
    Coaching, problem solving, economical contract programming.
    See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
     
    Roedy Green, Oct 18, 2003
    #6
  7. BLG

    Guest

    Roedy Green <> wrote:

    >Unicode is a 16 bit character set allowing 64K different glyphs/codes.


    Nope.

    But don't take my word as gospel. You might wish to start browsing here, to
    get the information straight from the source:

    http://www.unicode.org/faq/
     
    , Oct 18, 2003
    #7
  8. Roedy Green wrote:

    > On Sat, 18 Oct 2003 12:23:56 -0500, wrote or
    > quoted :
    >
    >
    >>>I understand that Unicode is a 16-bit character set and that true
    >>>ASCII is a 7-bit representation.

    >>
    >>That is incorrect.

    >
    >
    > Unicode is a 16 bit character set allowing 64K different glyphs/codes.


    Not any more; it hasn't been 16 bit for some time. The current
    incarnation of Unicode requires at least 20 bits. See
    http://www.unicode.org/versions/Unicode4.0.0/
    Note that there are 96248 'graphic' characters defined.

    Mark Thornton
     
    Mark Thornton, Oct 18, 2003
    #8
  9. BLG

    Roedy Green Guest

    On Sat, 18 Oct 2003 15:45:38 -0500, wrote or
    quoted :

    >>Unicode is a 16 bit character set allowing 64K different glyphs/codes.

    >
    >Nope.


    In what sense nope? I presume you are being picky about the precise
    meanings of "encoding", "character set" and "glyph". Am I wrong in
    any sense that would make a difference to anyone but a linguist?


    --
    Canadian Mind Products, Roedy Green.
    Coaching, problem solving, economical contract programming.
    See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
     
    Roedy Green, Oct 19, 2003
    #9
  10. BLG

    Roedy Green Guest

    On Sat, 18 Oct 2003 22:15:29 +0100, Mark Thornton
    <> wrote or quoted :

    >> Unicode is a 16 bit character set allowing 64K different glyphs/codes.

    >
    >Not any more; it hasn't been 16 bit for some time.


    Unicode without a trailing number means Unicode-16 does it not? or has
    that changed?

    --
    Canadian Mind Products, Roedy Green.
    Coaching, problem solving, economical contract programming.
    See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
     
    Roedy Green, Oct 19, 2003
    #10
  11. Roedy Green wrote:

    > On Sat, 18 Oct 2003 22:15:29 +0100, Mark Thornton
    > <> wrote or quoted :
    >
    >
    >>>Unicode is a 16 bit character set allowing 64K different glyphs/codes.

    >>
    >>Not any more; it hasn't been 16 bit for some time.

    >
    >
    > Unicode without a trailing number means Unicode-16 does it not? or has
    > that changed?


    It has changed. Unicode is the full collection of code points. There are
    various encodings such as UTF-8, UTF-16, and UTF-32.
    Unicode characters may be encoded at any code point from U+0000 to U+10FFFF.

    Q: What is the difference between UCS-2 and UTF-16?

    A: UCS-2 is what a Unicode implementation was up to Unicode 1.1,
    *before* surrogate code points and UTF-16 were added as concepts to
    Version 2.0 of the standard. This term should be now be avoided.

    Q: What is UTF-16?

    A: Unicode was originally designed as a pure 16-bit encoding, aimed at
    representing all modern scripts. (Ancient scripts were to be represented
    with private-use characters.) Over time, and especially after the
    addition of over 14,500 composite characters for compatibility with
    legacy sets, it became clear that 16-bits were not sufficient for the
    user community. Out of this arose UTF-16.


    Q. Will UTF-16 ever be extended to more than a million characters?

    A: As stated, the goal of Unicode is not to encode glyphs, but
    characters. Over a million possible codes is far more than enough for
    this goal. Unicode is *not* designed to encode arbitrary data. If you
    wanted, for example, to give each "instance of a character on paper
    throughout history" its own code, you might need trillions or
    quadrillions of such codes; noble as this effort might be, you would not
    use Unicode for such an encoding. No proposed extensions of UTF-16 to
    more than 2 surrogates has a chance of being accepted into the Unicode
    Standard or ISO/IEC 10646. Furthermore, both Unicode and ISO 10646 have
    policies in place that formally limit even the UTF-32 encoding form to
    the integer range that can be expressed with UTF-16 (or 21 significant
    bits). [MD]


    Mark Thornton
     
    Mark Thornton, Oct 19, 2003
    #11
  12. Michael Borgwardt wrote:
    > Unicode is a standard that assigns glyphs (characters) to numeric codes.


    Hmmm. I thought that there was a clear distinction between characters
    and glyphs. Character sets map characters to numeric codes (and vise
    versa), whereas fonts map glyphs to characters. There may be many
    different glyphs that represent any particular character (hence the
    differentiation of fonts), and in some cases a character may require
    more than one glyph. A character is a logical entity, without an
    inherent physical representation. Or so I thought. Am I suffering from
    a longstanding confusion here?


    John Bollinger
     
    John C. Bollinger, Oct 20, 2003
    #12
  13. BLG

    Roedy Green Guest

    On Mon, 20 Oct 2003 09:06:41 -0500, "John C. Bollinger"
    <> wrote or quoted :

    >mmm. I thought that there was a clear distinction between characters
    >and glyphs. Character sets map characters to numeric codes (and vise
    >versa), whereas fonts map glyphs to characters. There may be many
    >different glyphs that represent any particular character (hence the
    >differentiation of fonts), and in some cases a character may require
    >more than one glyph. A character is a logical entity, without an
    >inherent physical representation. Or so I thought. Am I suffering from
    >a longstanding confusion here?


    Ve need some definitions that make clear the distinction between:
    an character set,
    a character
    a glyph
    a font
    an encoding.

    --
    Canadian Mind Products, Roedy Green.
    Coaching, problem solving, economical contract programming.
    See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
     
    Roedy Green, Oct 21, 2003
    #13
  14. Re: Java Newbie Question: Character Sets, Unicode, et al [Long]

    Roedy Green wrote:
    > Ve need some definitions that make clear the distinction between:
    > an character set,
    > a character
    > a glyph
    > a font
    > an encoding.


    I'll take a stab at it.

    I decided to see what Unicode had to say on the matter. That seemed
    relevant, but may have been a mistake. In any event, here are some
    possibly relevant defintions from the Unicode 4.0 glossary
    (http://www.unicode.org/glossary/):

    ==== From the Unicode 4.0 Glossary ====

    Abstract Character. A unit of information used for the organization,
    control, or representation of textual data. (See Definition D3 in
    Section 3.3, Characters and Coded Representations .)

    Character. (1) The smallest component of written language that has
    semantic value; refers to the abstract meaning and/or shape, rather than
    a specific shape (see also glyph), though in code tables some form of
    visual representation is essential for the reader's understanding. (2)
    Synonym for abstract character. (See Definition D3 in Section 3.3,
    Characters and Coded Representations .) (3) The basic unit of encoding
    for the Unicode character encoding. (4) The English name for the
    ideographic written elements of Chinese origin. (See ideograph (2).)

    Character Encoding Form. Mapping from a character set definition to the
    actual code units used to represent the data.

    Character Encoding Scheme. A character encoding form plus byte
    serialization. There are seven character encoding schemes in Unicode:
    UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE and UTF-32LE.

    Character Set. A collection of elements used to represent textual
    information.

    Coded Character Set. A character set in which each character is assigned
    a numeric code point. Frequently abbreviated as character set, charset,
    or code set.

    Code Point. (1) A numerical index (or position) in an encoding table
    used for encoding characters. (2) Synonym for Unicode scalar value.

    Code Unit. The minimal bit combination that can represent a unit of
    encoded text for processing or interchange. (See Definition D5 in
    Section 3.3, Characters and Coded Representations .)

    Encoded Character. An abstract character together with its associated
    Unicode scalar value (code point). By itself, an abstract character has
    no numerical value, but the process of "encoding a character" associates
    a particular Unicode scalar value with a particular abstract character,
    thereby resulting in an "encoded character."

    Encoding Form. (See character encoding form.)

    Encoding Scheme. (See character encoding scheme.)

    Font. A collection of glyphs used for the visual depiction of character
    data. A font is often associated with a set of parameters (for example,
    size, posture, weight, and serifness), which, when set to particular
    values, generate a collection of imagable glyphs.

    Glyph. (1) An abstract form that represents one or more glyph images.
    (2) A synonym for glyph image. In displaying Unicode character data, one
    or more glyphs may be selected to depict a particular character. These
    glyphs are selected by a rendering engine during composition and layout
    processing. (See also character.)

    Glyph Image. The actual, concrete image of a glyph representation having
    been rasterized or otherwise imaged onto some display surface.

    ==== End of Glossary Excerpt ====

    Note: the glossary does not contain a definition of "character
    encoding", but it seems to be used in the Unicode context as an
    abbreviation for "character encoding form" (_not_ "character encoding
    scheme").

    The Java documentation is a bit confused and inconsistent in its
    terminology with respect to characters, character sets, and character
    encodings. For instance "the Java platform uses Unicode as its native
    character encoding" (i18n docs), but at the same time UTF-8 and UTF-16
    are referred to as "character encodings" that all Java implementations
    must support. Elsewhere in the docs these are the names of "charsets",
    which is to say that in some places the docs call them "encodings" and
    in others "charsets." Moreover, in Java 1.4 there is now
    java.nio.charset.Charset, "A named mapping between sequences of
    sixteen-bit Unicode characters and sequences of bytes." This is
    consistent with MIME, but different from Unicode usage (which would call
    such a thing a "character encoding scheme", except that it also implies
    a certain collection of characters to which it can apply).

    RFC 2045 (MIME) says, in part:
    NOTE: The term "character set" was originally to describe such
    straightforward schemes as US-ASCII and ISO-8859-1 which have a
    simple one-to-one mapping from single octets to single characters.
    Multi-octet coded character sets and switching techniques make the
    situation more complex. For example, some communities use the term
    "character encoding" for what MIME calls a "character set", while
    using the phrase "coded character set" to denote an abstract mapping
    from integers (not octets) to characters.
    That describes terminology similar to Unicode's current terminology, but
    not exactly the same.

    Java documentation has also changed over the years with regard to how it
    names these things.

    What a muddle.

    As this is a Java newsgroup, we are stuck with the inconsistencies of
    the Java documentation, at least to some extent. We are also stuck with
    the ambiguity of some of our terms (note, for instance, the four (four!)
    different definitions of "character" above -- I find that I personally
    use the word in each of those ways). On the other hand, there seem to
    be some points on which there is little disagreement (at least in
    official douments) such as a "font" as a collection of glyphs, a "glyph"
    as a depiction of a character, and a definite distinction between glyphs
    and characters (whatever those are).

    Luckilly, in a Java context there is little call for usage of the term
    "character set" in the Unicode sense, so it seems reasonable and
    appropriate that in this venue we should use it in the MIME / Java NIO
    sense (as defined by java.nio.charset.Charset.) "Charset" is a synonym.

    By an "encoding" or "character encoding", I think we generally mean what
    Unicode calls a "character encoding scheme" with the implicit
    recognition that in a Java context the "character encoding form" is the
    one defined by Unicode. This is closely related to a "charset" as
    defined in the previous paragraph.

    That leaves only "character" of the five terms of interest, and I don't
    think I can do much better than the Unicode glossary there, except to
    note the existence of class java.lang.Character, a related but distinct
    entity. A character is definitely distinct from a glyph -- the latter
    is a possible representation (or part of one) of the former, for some of
    the defintions of the former.


    John Bollinger
     
    John C. Bollinger, Oct 21, 2003
    #14
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. jb
    Replies:
    5
    Views:
    408
    Benjamin Niemann
    Mar 29, 2006
  2. Michael

    character sets? unicode?

    Michael, Feb 3, 2005, in forum: Python
    Replies:
    0
    Views:
    299
    Michael
    Feb 3, 2005
  3. Kenneth McDonald
    Replies:
    1
    Views:
    895
    Carl Banks
    Dec 27, 2006
  4. Michal Ludvig

    File names, character sets and Unicode

    Michal Ludvig, Dec 12, 2008, in forum: Python
    Replies:
    1
    Views:
    330
    Marc 'BlackJack' Rintsch
    Dec 12, 2008
  5. Tyler
    Replies:
    1
    Views:
    1,009
    Robert Klemme
    Jul 29, 2011
Loading...

Share This Page