Roedy said:
Ve need some definitions that make clear the distinction between:
an character set,
a character
a glyph
a font
an encoding.
I'll take a stab at it.
I decided to see what Unicode had to say on the matter. That seemed
relevant, but may have been a mistake. In any event, here are some
possibly relevant defintions from the Unicode 4.0 glossary
(
http://www.unicode.org/glossary/):
==== From the Unicode 4.0 Glossary ====
Abstract Character. A unit of information used for the organization,
control, or representation of textual data. (See Definition D3 in
Section 3.3, Characters and Coded Representations .)
Character. (1) The smallest component of written language that has
semantic value; refers to the abstract meaning and/or shape, rather than
a specific shape (see also glyph), though in code tables some form of
visual representation is essential for the reader's understanding. (2)
Synonym for abstract character. (See Definition D3 in Section 3.3,
Characters and Coded Representations .) (3) The basic unit of encoding
for the Unicode character encoding. (4) The English name for the
ideographic written elements of Chinese origin. (See ideograph (2).)
Character Encoding Form. Mapping from a character set definition to the
actual code units used to represent the data.
Character Encoding Scheme. A character encoding form plus byte
serialization. There are seven character encoding schemes in Unicode:
UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE and UTF-32LE.
Character Set. A collection of elements used to represent textual
information.
Coded Character Set. A character set in which each character is assigned
a numeric code point. Frequently abbreviated as character set, charset,
or code set.
Code Point. (1) A numerical index (or position) in an encoding table
used for encoding characters. (2) Synonym for Unicode scalar value.
Code Unit. The minimal bit combination that can represent a unit of
encoded text for processing or interchange. (See Definition D5 in
Section 3.3, Characters and Coded Representations .)
Encoded Character. An abstract character together with its associated
Unicode scalar value (code point). By itself, an abstract character has
no numerical value, but the process of "encoding a character" associates
a particular Unicode scalar value with a particular abstract character,
thereby resulting in an "encoded character."
Encoding Form. (See character encoding form.)
Encoding Scheme. (See character encoding scheme.)
Font. A collection of glyphs used for the visual depiction of character
data. A font is often associated with a set of parameters (for example,
size, posture, weight, and serifness), which, when set to particular
values, generate a collection of imagable glyphs.
Glyph. (1) An abstract form that represents one or more glyph images.
(2) A synonym for glyph image. In displaying Unicode character data, one
or more glyphs may be selected to depict a particular character. These
glyphs are selected by a rendering engine during composition and layout
processing. (See also character.)
Glyph Image. The actual, concrete image of a glyph representation having
been rasterized or otherwise imaged onto some display surface.
==== End of Glossary Excerpt ====
Note: the glossary does not contain a definition of "character
encoding", but it seems to be used in the Unicode context as an
abbreviation for "character encoding form" (_not_ "character encoding
scheme").
The Java documentation is a bit confused and inconsistent in its
terminology with respect to characters, character sets, and character
encodings. For instance "the Java platform uses Unicode as its native
character encoding" (i18n docs), but at the same time UTF-8 and UTF-16
are referred to as "character encodings" that all Java implementations
must support. Elsewhere in the docs these are the names of "charsets",
which is to say that in some places the docs call them "encodings" and
in others "charsets." Moreover, in Java 1.4 there is now
java.nio.charset.Charset, "A named mapping between sequences of
sixteen-bit Unicode characters and sequences of bytes." This is
consistent with MIME, but different from Unicode usage (which would call
such a thing a "character encoding scheme", except that it also implies
a certain collection of characters to which it can apply).
RFC 2045 (MIME) says, in part:
NOTE: The term "character set" was originally to describe such
straightforward schemes as US-ASCII and ISO-8859-1 which have a
simple one-to-one mapping from single octets to single characters.
Multi-octet coded character sets and switching techniques make the
situation more complex. For example, some communities use the term
"character encoding" for what MIME calls a "character set", while
using the phrase "coded character set" to denote an abstract mapping
from integers (not octets) to characters.
That describes terminology similar to Unicode's current terminology, but
not exactly the same.
Java documentation has also changed over the years with regard to how it
names these things.
What a muddle.
As this is a Java newsgroup, we are stuck with the inconsistencies of
the Java documentation, at least to some extent. We are also stuck with
the ambiguity of some of our terms (note, for instance, the four (four!)
different definitions of "character" above -- I find that I personally
use the word in each of those ways). On the other hand, there seem to
be some points on which there is little disagreement (at least in
official douments) such as a "font" as a collection of glyphs, a "glyph"
as a depiction of a character, and a definite distinction between glyphs
and characters (whatever those are).
Luckilly, in a Java context there is little call for usage of the term
"character set" in the Unicode sense, so it seems reasonable and
appropriate that in this venue we should use it in the MIME / Java NIO
sense (as defined by java.nio.charset.Charset.) "Charset" is a synonym.
By an "encoding" or "character encoding", I think we generally mean what
Unicode calls a "character encoding scheme" with the implicit
recognition that in a Java context the "character encoding form" is the
one defined by Unicode. This is closely related to a "charset" as
defined in the previous paragraph.
That leaves only "character" of the five terms of interest, and I don't
think I can do much better than the Unicode glossary there, except to
note the existence of class java.lang.Character, a related but distinct
entity. A character is definitely distinct from a glyph -- the latter
is a possible representation (or part of one) of the former, for some of
the defintions of the former.
John Bollinger
(e-mail address removed)