Oliver said:
I don't think it should make sense to manipulate characters as
integers, just like it doesn't make sense to manipulate Strings which
coincidentally have length 1 as integers.
At one level I agree with you; there's something unnatural about conflating
characters and integers. In fact Smalltalk works exactly how you suggest, and
my own Unicode implementation for Smalltalk (under construction) works that way
too, so I have a fair bit of experience using a system which separates the two
concepts.
But that's only half the story. You also need to be able to do a significant
subset of arithmentical operations on character values (indexing into arrays
for instance), and such operations often turn up in places where constantly
casting back-and-forth between integer code points and actual characters would
be painful and/or inefficient. Java doesn't really support the idea of
"hybrid" values -- half arithmetical, half not -- so, barring major changes to
the language, I'd stick with the current scheme, but make "char" wider.
It's perhaps worth emphasising that, in Unicode, a character has very little
meaning by itself -- it is, in general, not possible to do anything very useful
with a character which isn't an element of a stream or string. Pretty-much the
only things you can legitimately do with a char are compare it with another
char or use it as a lookup index into Unicode character property tables. A
character is /not/ like a short string -- it's a different class of entity
entirely.
Tell you what. How about, since we're redefining Java anyway, we rename "char"
to "codepoint" ? It would be more accurate...
Is this "humans-only" requirement actually documented anywhere?
Not that I know of, although it wouldn't surprise me to find the human-centric
design principles discussed somewhere. Unicode includes rather a lot of
thoughtful and interesting meta-discussion in it's documentation (if not in the
standard itself).
The way that Unicode works is extremely practical and /not/ universal (see
below). It introduces features only if they are used in some target
orthography. Thus it has ligatures, since they are essential in many systems
of writing. It also attempts to make round-tripping from other charsets, into
Unicode, and back possible (no information lost), and so has a very limited
number of Latin ligatures (and that's the /only/ reason it has Latin
ligatures). No writing system uses colour to denote meaning (that I know of)
and so Unicode doesn't touch colour. The result of this YAGNI-like focus on
features that are actually needed, is that Unicode inevitably reflects the
human processes which create written languages, and which determine their
logical structure. One huge example is that human vision uses edge-detection
heavily. As a result Unicode glyphs are /shapes/ -- shapes which can be
rendered as black-on-white.
BTW, don't get mislead by the odd few Unicode code points which are assigned to
non-visual purposes -- the BOM being a good example, or the directionality
markers. There are damned few of those, and for the most part they only exist
in order to allow round-tripping or the use of Unicode in a context where
insufficient meta-information is available, and their use is disouraged in
other contexts. Unicode is /about/ shapes.
It's worth considering how much Unicode /doesn't/ have which it might be
expected to include if the focus weren't so limited. For instance it has no
way of expressing /semantic/ qualifiers on text such as italics (or, more
abstractly, emphasis). It has no means of rendering prosody beyond the limited
expression implied by existing punctuation schemes[*]. Yet if the
text-to-speech example could be taken as a core use for Unicode -- i.e. as a
true alternative rendering of Unicode, on an equal footing with printing text
on paper -- then such annotations would seem to be highly desirable, perhaps
even necessary.
([*] Another aside: apparently English punctuation started out -- with the
Greeks, naturally -- purely as a way of expressing prosody, but at around the
time fully modern English emerged, the punctuation system had its own
mini-revolution: new marks were invented, old marks were reinterpreted or
discarded, and the role of punctuation shifted away from expressing prosody to
expressing grammar and other semantic features of text.)
I
mean, if we found out that, for example, spiders encoded some
communicative information within the patterns of their webs and we
managed to decode it, would it be "against policy" to add symbols from
this spider-language to Unicode? Or would we say "well, now since we, as
humans, have decoded it, it becomes a human writing scheme, and so is apt
to be used in Unicode"?
I don't think it's a policy thing at all. If this situation were ever to
arise, then I think one of two things would happen. Either we humans (not
being able to "see" the patterns properly since we lack the necessary brain
circuitry) would develop an independent glyph-system for representing the
patterns (and whatever other features were needed). In that case the new glyph
system might get added to Unicode if enough humans wanted to represent
Spiderese texts in their discussions with other humans. Note that the spiders
themselves would probably not be able to "see" our human glyphs any more than
we could see theirs, so this system would be solely for human use. This is
roughly what has happened for musical notation[**] Alternatively it might turn
out that human/spider brains were similar enough that we could read their
patterns directly (I have to say that I find this almost impossible to
imagine), in that case it would come down to the practicalities. Does written
Spiderese break down into a glyph system similar enough to the existing human
ones for it to be expressed in the Unicode framework ? I find this even harder
to imagine, but if it /did/ turn out that way then I see no reason for
spider-glyphs not to be added to Unicode. To me (presupposing the existence of
other intelligences at all) it seems much more likely that their communications
wouldn't have a modality which was anywhere near close enough to human writing
to fit into Unicode. Spiders, for instance, might be much more likely to use
moving patterns of standing waves in their webs (vibrations /matter/ to
spiders). Almost any species might naturally record meaning as structures in a
very-high dimensional space -- smell is far more universal on Earth than
vision.
([**] BTW, it seems to me that musical notation is in Unicode because people
want to write /about/ music, not in order to /express/ music per se.)
Your (snipped) point about Unicode assuming sequence is well-taken. Some human
written languages don't make much use of sequence. I can't remember which
off-hand, but some of the old South American languages just bung a number of
symbols/pictures together into a cartoon-like frame, and leave it to the reader
to work out which express a meaning and which qualifies what. It's an
interesting system since it allows a lot of freedom for the writer to be
creative with the pictures and layout. I don't know how such systems would be
mapped into Unicode. It'd be possible, I suppose, to write the symbols down in
an arbitrary, or conventual, order, but I don't know if that would be any use
for scholars, who might want to preserve the spatial layout. If not then
they'd probably be better off using JPEGs instead of Unicode text.
I /think/ I may have worked out where we're seeing Unicode differently.
There's a parallel with dictionaries, which come in two broad flavours. There
are the dictionaries which attempt to /record/ what the (written or not)
language is like at a given time and place (or over a range of such). The OED
is the incomparable exemplar of this school of thought. And then there are the
/prescriptive/ dictionaries -- ones which attempt to tell readers what the
"correct" meaning and spelling of a word is. In the dictionary world the
prescriptive idea has long gone out of fashion[***], and prescriptive
dictionaries are only used for teaching purposes. So, if people start --
say -- confusing "convince" and "persuade", the dictionaries will simply
reflect that in their next edition, whereas a school dictionary will attempt to
dictate that the two words have separate meanings (with a small amount of
overlap).
The parallel here is that I think you are seeing Unicode as non-prescriptive in
that sense, whereas I see it as essentially prescriptive. It's purpose -- as I
see it -- is not to /record/ the diversity of the worlds scripts, but to
/standardise/ their computerised representation. The motive is purely
practical, with no scholarly side to it at all. (Although considerable
scholarship goes into creating it, and it is intended to be used /by/
scholars.) The purpose is only to allow people to share written texts across
different computers -- and for that a prescriptive approach is necessary. A
/standard/.
([***] Since about Samuel Johnson's time, although the idea does resurface from
time to time -- I believe the original Webster's Dictionary was primarily
prescriptive.)
-- chris