Java Newbie Question: Character Sets, Unicode, et al

B

BLG

Greetings!

I am studying several books on the Java programming language and have
come across references to the fact that the JDK uses Unicode as the
default character set. I am not convinced that I fully comprehend
what the author is telling me.

I understand that Unicode is a 16-bit character set and that true
ASCII is a 7-bit representation. When I look at my source files in a
hex editor, they appear to be extended ASCII 8-bit format (which I
assume is the Windows default for a text file). OK - I assume then
that the JRE uses Unicode character sets, but javac uses some 8-bit
character set. Is this correct?

But beyond that, should I even care what the character set is?
Assuming, of course, internationalization is not a priority for me.

Also, how do I determine what character set Windows is using? How do
I change character sets in Windows?

And lastly, what is the relationship between a character set and a
font?

I hope these questions aren't too off the wall. I am trying to
clarify in my mind this character set concept. In the past, my only
concern has been ASCII vs EBCDIC.

Regards!
 
C

Chris Smith

BLG said:
I understand that Unicode is a 16-bit character set and that true
ASCII is a 7-bit representation.

Actually, that's a more accurate statement than most people would have
made. If you want to be really picky, though, I'd make one correction:
Unicode and ASCII are both character sets. ASCII is *also* a character
encoding (which may be what you meant by representation), which Unicode
is not (instead, there are several common encodings for the Unicode
character set, including UTF-8, UTF-16LE and UTF-16BE).
When I look at my source files in a
hex editor, they appear to be extended ASCII 8-bit format (which I
assume is the Windows default for a text file).

No. There is no such thing as "extended ASCII 8-bit format" as some
specific entity. There are, actually, quite a large number of different
8-bit character encodings, including Cp1252, ISO8859-1, ISO8859-2,
ISO8859-3, and so on and so forth, and practically all of them are
different extensions to ASCII. ASCII itself, as a character encoding,
is in fact an 8-bit encoding as well, with the high-order bit always
being set to zero.
OK - I assume then
that the JRE uses Unicode character sets, but javac uses some 8-bit
character set. Is this correct?

Not necessarily. The javac compiler uses the platform default character
encoding. What that is depends on what platform you're developing on.

More to the point, though, you're looking to the wrong place for that
answer. The javac utility is a *consumer* of your source files; it
doesn't create them. The format of your source files comes from
whatever tool you've used to write them, and I don't know what tool that
is. You're fortunate that it happens to be compatible with what javac
expects to read (which will generally happen if you stick to ASCII
characters in your source code, but can be a problem if not)
But beyond that, should I even care what the character set is?

Sure. If you expect to write robust character-based I/O code, that is
(in any language, not just Java).
Assuming, of course, internationalization is not a priority for me.

That's not necessarily relevant. Encodings vary between platforms,
languages and language configurations, and applications, among other
things. They are often specified in protocol and file format
descriptions. You don't have to be doing i18n code to care about the
definition of a character encoding.
Also, how do I determine what character set Windows is using? How do
I change character sets in Windows?

Actually, I have no idea. I've never needed to do it.
And lastly, what is the relationship between a character set and a
font?

A font provides glyphs (visual appearances) for some set of characters.
The relationship, I suppose, is that if you want to reliably display
content in a certain character set, your font had better have the
appropriate glyphs for at least the common characters in that character
set. In Java, fonts map their glyphs directly to Unicode characters, so
there's no direct relationship between the smaller character sets like
ASCII and fonts.

--
www.designacourse.com
The Easiest Way to Train Anyone... Anywhere.

Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation
 
M

Michael Borgwardt

BLG said:
I understand that Unicode is a 16-bit character set and that true
ASCII is a 7-bit representation.

Actually, Unicode is not really a character set in the way ASCII is,
and it is not restricted to 16 bits.

Unicode is a standard that assigns glyphs (characters) to numeric codes.
How these codes are concretely represented as bytes is what an encoding
or charset specifies, which is what ASCII is. There are encodings where
the number of bits used varies depending on each character, like UTF-8.
There are even stateful encodings.
When I look at my source files in a
hex editor, they appear to be extended ASCII 8-bit format (which I
assume is the Windows default for a text file).

Namely Windows Codepage 1252, which is nearly the same as
ISO-8859-1, aka Latin 1, the most common encoding for western
European languages.
OK - I assume then
that the JRE uses Unicode character sets, but javac uses some 8-bit
character set. Is this correct?

Nearly. How the JRE internally represents Strings is not really
specified, but the usualy way is to use 16bit per character
in a straightforward way. javac, on the other hand, uses the
platform standard encoding (unless otherwise specified on the
command line), with an additional capability to use unicode
escape sequences (\Uxxxx), when reading in source files. The
class files contain Strings encoded as UTF-8.
But beyond that, should I even care what the character set is?
Assuming, of course, internationalization is not a priority for me.

Yes, it still is important when writing text out to or reading from
from a file or network socket. It's quite likely that at some point
you'll use *some* non-ASCII character, and in fact it is not even
guaranteed that all encodings represent even pure ASCII text
identically.
Also, how do I determine what character set Windows is using?

More recent Windows versions (since 2000 I think) also use Unicode
internally as far as possible, but older applications that can't
use a "traditional encoding" that differs between languages.
This is the platform default encoding.
In Java, it's a System property, file.encoding or some such.
How do I change character sets in Windows?

There's an option in the country&language settings somwhere that
changes the default encoding used for older apps.
And lastly, what is the relationship between a character set and a
font?

An encoding defines relationships between numeric codes or byte
representations thereof and glyphs. A font defines how the glyphs
are drawn on the screen. Different abstract glyphs can be (and
sometimes are) assigned the same shape in a font, and nearly all
fonts contain only shapes for a subset of the glyphs defined in
Unicode.
 
R

Roedy Green

Unicode is a standard that assigns glyphs (characters) to numeric codes.
How these codes are concretely represented as bytes is what an encoding
or charset specifies, which is what ASCII is. There are encodings where
the number of bits used varies depending on each character, like UTF-8.
There are even stateful encodings.

There is only one way you can encode ASCII as bytes, but there are
several variants for encoding Unicode with combinations of big/little
endian, marked/unmarked, 8-bit/16-bit encoding.

see http://mindprod.com/jgloss/encoding.html
 
R

Roedy Green


In what sense nope? I presume you are being picky about the precise
meanings of "encoding", "character set" and "glyph". Am I wrong in
any sense that would make a difference to anyone but a linguist?
 
M

Mark Thornton

Roedy said:
Unicode without a trailing number means Unicode-16 does it not? or has
that changed?

It has changed. Unicode is the full collection of code points. There are
various encodings such as UTF-8, UTF-16, and UTF-32.
Unicode characters may be encoded at any code point from U+0000 to U+10FFFF.

Q: What is the difference between UCS-2 and UTF-16?

A: UCS-2 is what a Unicode implementation was up to Unicode 1.1,
*before* surrogate code points and UTF-16 were added as concepts to
Version 2.0 of the standard. This term should be now be avoided.

Q: What is UTF-16?

A: Unicode was originally designed as a pure 16-bit encoding, aimed at
representing all modern scripts. (Ancient scripts were to be represented
with private-use characters.) Over time, and especially after the
addition of over 14,500 composite characters for compatibility with
legacy sets, it became clear that 16-bits were not sufficient for the
user community. Out of this arose UTF-16.


Q. Will UTF-16 ever be extended to more than a million characters?

A: As stated, the goal of Unicode is not to encode glyphs, but
characters. Over a million possible codes is far more than enough for
this goal. Unicode is *not* designed to encode arbitrary data. If you
wanted, for example, to give each "instance of a character on paper
throughout history" its own code, you might need trillions or
quadrillions of such codes; noble as this effort might be, you would not
use Unicode for such an encoding. No proposed extensions of UTF-16 to
more than 2 surrogates has a chance of being accepted into the Unicode
Standard or ISO/IEC 10646. Furthermore, both Unicode and ISO 10646 have
policies in place that formally limit even the UTF-32 encoding form to
the integer range that can be expressed with UTF-16 (or 21 significant
bits). [MD]


Mark Thornton
 
J

John C. Bollinger

Michael said:
Unicode is a standard that assigns glyphs (characters) to numeric codes.

Hmmm. I thought that there was a clear distinction between characters
and glyphs. Character sets map characters to numeric codes (and vise
versa), whereas fonts map glyphs to characters. There may be many
different glyphs that represent any particular character (hence the
differentiation of fonts), and in some cases a character may require
more than one glyph. A character is a logical entity, without an
inherent physical representation. Or so I thought. Am I suffering from
a longstanding confusion here?


John Bollinger
(e-mail address removed)
 
R

Roedy Green

mmm. I thought that there was a clear distinction between characters
and glyphs. Character sets map characters to numeric codes (and vise
versa), whereas fonts map glyphs to characters. There may be many
different glyphs that represent any particular character (hence the
differentiation of fonts), and in some cases a character may require
more than one glyph. A character is a logical entity, without an
inherent physical representation. Or so I thought. Am I suffering from
a longstanding confusion here?

Ve need some definitions that make clear the distinction between:
an character set,
a character
a glyph
a font
an encoding.
 
J

John C. Bollinger

Roedy said:
Ve need some definitions that make clear the distinction between:
an character set,
a character
a glyph
a font
an encoding.

I'll take a stab at it.

I decided to see what Unicode had to say on the matter. That seemed
relevant, but may have been a mistake. In any event, here are some
possibly relevant defintions from the Unicode 4.0 glossary
(http://www.unicode.org/glossary/):

==== From the Unicode 4.0 Glossary ====

Abstract Character. A unit of information used for the organization,
control, or representation of textual data. (See Definition D3 in
Section 3.3, Characters and Coded Representations .)

Character. (1) The smallest component of written language that has
semantic value; refers to the abstract meaning and/or shape, rather than
a specific shape (see also glyph), though in code tables some form of
visual representation is essential for the reader's understanding. (2)
Synonym for abstract character. (See Definition D3 in Section 3.3,
Characters and Coded Representations .) (3) The basic unit of encoding
for the Unicode character encoding. (4) The English name for the
ideographic written elements of Chinese origin. (See ideograph (2).)

Character Encoding Form. Mapping from a character set definition to the
actual code units used to represent the data.

Character Encoding Scheme. A character encoding form plus byte
serialization. There are seven character encoding schemes in Unicode:
UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE and UTF-32LE.

Character Set. A collection of elements used to represent textual
information.

Coded Character Set. A character set in which each character is assigned
a numeric code point. Frequently abbreviated as character set, charset,
or code set.

Code Point. (1) A numerical index (or position) in an encoding table
used for encoding characters. (2) Synonym for Unicode scalar value.

Code Unit. The minimal bit combination that can represent a unit of
encoded text for processing or interchange. (See Definition D5 in
Section 3.3, Characters and Coded Representations .)

Encoded Character. An abstract character together with its associated
Unicode scalar value (code point). By itself, an abstract character has
no numerical value, but the process of "encoding a character" associates
a particular Unicode scalar value with a particular abstract character,
thereby resulting in an "encoded character."

Encoding Form. (See character encoding form.)

Encoding Scheme. (See character encoding scheme.)

Font. A collection of glyphs used for the visual depiction of character
data. A font is often associated with a set of parameters (for example,
size, posture, weight, and serifness), which, when set to particular
values, generate a collection of imagable glyphs.

Glyph. (1) An abstract form that represents one or more glyph images.
(2) A synonym for glyph image. In displaying Unicode character data, one
or more glyphs may be selected to depict a particular character. These
glyphs are selected by a rendering engine during composition and layout
processing. (See also character.)

Glyph Image. The actual, concrete image of a glyph representation having
been rasterized or otherwise imaged onto some display surface.

==== End of Glossary Excerpt ====

Note: the glossary does not contain a definition of "character
encoding", but it seems to be used in the Unicode context as an
abbreviation for "character encoding form" (_not_ "character encoding
scheme").

The Java documentation is a bit confused and inconsistent in its
terminology with respect to characters, character sets, and character
encodings. For instance "the Java platform uses Unicode as its native
character encoding" (i18n docs), but at the same time UTF-8 and UTF-16
are referred to as "character encodings" that all Java implementations
must support. Elsewhere in the docs these are the names of "charsets",
which is to say that in some places the docs call them "encodings" and
in others "charsets." Moreover, in Java 1.4 there is now
java.nio.charset.Charset, "A named mapping between sequences of
sixteen-bit Unicode characters and sequences of bytes." This is
consistent with MIME, but different from Unicode usage (which would call
such a thing a "character encoding scheme", except that it also implies
a certain collection of characters to which it can apply).

RFC 2045 (MIME) says, in part:
NOTE: The term "character set" was originally to describe such
straightforward schemes as US-ASCII and ISO-8859-1 which have a
simple one-to-one mapping from single octets to single characters.
Multi-octet coded character sets and switching techniques make the
situation more complex. For example, some communities use the term
"character encoding" for what MIME calls a "character set", while
using the phrase "coded character set" to denote an abstract mapping
from integers (not octets) to characters.
That describes terminology similar to Unicode's current terminology, but
not exactly the same.

Java documentation has also changed over the years with regard to how it
names these things.

What a muddle.

As this is a Java newsgroup, we are stuck with the inconsistencies of
the Java documentation, at least to some extent. We are also stuck with
the ambiguity of some of our terms (note, for instance, the four (four!)
different definitions of "character" above -- I find that I personally
use the word in each of those ways). On the other hand, there seem to
be some points on which there is little disagreement (at least in
official douments) such as a "font" as a collection of glyphs, a "glyph"
as a depiction of a character, and a definite distinction between glyphs
and characters (whatever those are).

Luckilly, in a Java context there is little call for usage of the term
"character set" in the Unicode sense, so it seems reasonable and
appropriate that in this venue we should use it in the MIME / Java NIO
sense (as defined by java.nio.charset.Charset.) "Charset" is a synonym.

By an "encoding" or "character encoding", I think we generally mean what
Unicode calls a "character encoding scheme" with the implicit
recognition that in a Java context the "character encoding form" is the
one defined by Unicode. This is closely related to a "charset" as
defined in the previous paragraph.

That leaves only "character" of the five terms of interest, and I don't
think I can do much better than the Unicode glossary there, except to
note the existence of class java.lang.Character, a related but distinct
entity. A character is definitely distinct from a glyph -- the latter
is a possible representation (or part of one) of the former, for some of
the defintions of the former.


John Bollinger
(e-mail address removed)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,055
Latest member
SlimSparkKetoACVReview

Latest Threads

Top