If you could add anything you want

Oliver Wong · May 19, 2006

Dale King said:
Not appropriate as these are motions, not symbols, but if there are
symbols that are commonly used they should be proposed.

Kent Paul Dolan brought up similar arguments.

As I was replying to a post by Bent C Dalager, it occured to me that
Unicode does not concern itself with the representation of characters at
all, so it is perfectly feasible that Unicode could support "animated"
glyphs. My post is at
http://groups.google.ca/group/comp.lang.java.programmer/msg/cd269f6cfc8392fc

There are actually quite a few "unprintable" characters in Unicode, so
it wouldn't be a novelty to have characters that could not actually be
displayed in a traditional text editor. In fact, the codecharts have an
entire section called "Invisible Operators"
(http://www.unicode.org/charts/PDF/U2000.pdf, though in actuallity, some of
the characters defined there are indeed "visible").

\u2029, for example, "Paragraph Seperator" is invisible, and is merely
to control the flow of text. It cannot, in itself, be displayed in any form.

So it doesn't seem unreasonable to have some unicode character "\ufoo"
which represents a ASL gesture, which cannot be represented via some static
glyph.

- Oliver

Roedy Green · May 19, 2006

. Also,
notice that Unicode _doesn't_ include fonts or font
styles, just alphabet generic glyph identifiers and
ideograph generic glyph identifiers

That is the theory, but in practice you will find multiple symbols all
looking suspiciously like an A.

Roedy Green · May 19, 2006

Why not use the International Phonetic Alphabet, which is already
represented in Unicode?

The original question was what might people do to Unicode to expand
it, not what SHOULD they do.

There was a phonetic alphabet designed for native Canadian languages.
It is very pretty, but it is as almost as bad as Hebrew for having
very similar letters you would have to look carefully at to
discriminate.

Roedy Green · May 19, 2006

But would any sane new language use a writing system like Chinese ? And, if it
did, why would anyone want to take it seriously enough to add it to Unicode.

George Bernard Shaw did something similar, with his Shavian phonetic
alphabet for English and it is supported in Unicode even though it is
only of historical interest.
..
http://www.unicode.org/charts/PDF/U10450.pdf

Roedy Green · May 19, 2006

Many of these exist for common ones.

I read somewhere they decide not to add any more ligatures. However,
in typesetting you still need a code for them, so I suspect eventually
they will be given Unicode slots.

Roedy Green · May 19, 2006

You would get into trouble if it turns out that the exact stickiness
(however stickiness is measured) of the strands involved in the symbol
are vital to the correct interpretation of the message.

More likely is decoding something of squid language which entails
rapid subtle colour changes. Presumably they compute something with
their outsized brains.

Dale King · May 20, 2006

Roedy said:
George Bernard Shaw did something similar, with his Shavian phonetic
alphabet for English and it is supported in Unicode even though it is
only of historical interest.
..
http://www.unicode.org/charts/PDF/U10450.pdf

Note that in the case of Shavian, it was not a new language but a
different alphabetic representation for existing languages. It is more
closely related to the phonetic alphabet than a new language.

And given the amount of information on the web it seems to have more
than just historical interest.

Kent Paul Dolan · May 21, 2006

Kent Paul Dolan brought up similar arguments.

As I was replying to a post by Bent C Dalager, it
occured to me that Unicode does not concern itself
with the representation of characters at all, so
it is perfectly feasible that Unicode could
support "animated" glyphs.

There are actually quite a few "unprintable"
characters in Unicode, so it wouldn't be a novelty
to have characters that could not actually be
displayed in a traditional text editor.

So it doesn't seem unreasonable to have some
unicode character "\ufoo" which represents a ASL
gesture, which cannot be represented via some
static glyph.

That might work for the various choreography
annotations, though the "alphabets" would be rather
huge unless done, as are many ligitures, as a set of
overstrikes of simpler motions, body part by body
part, all "printed" at the same location, or strung
out as a "word of motion".

But ASL "words/gestures" have context dependent
meanings, among other difficulties in capturing them
in brief encodings.

Their meanings sometimes also depend profoundly, not
just casually, on accompanying facial expressions.

Too, a typical ASL paragraph consists of putting
actors/items at various places in space in front
of the "speaker", then play-acting interactions
among those locations and their contents.

In many senses, ASL is a much richer language than
most spoken/written languages. I'm not an expert ASL
speaker, but I get by in simple conversations, and I
just don't see a way short of full video recording
to convey an ASL conversation with the usual ASL
conventions.

That recording would need to be stereo video
recording, too, to give depth perception. ASL's
"location in space meaning" is dependent on four
dimensions. X, Y, Z, and speed of execution of the
gesture are all modifiers to that gesture's meaning.

Even how broadly the gesture is made modifies its
meaning, as may all of its starting point, its
ending point. and its curved path through space.

Surely attempting to convey such a conversation in a
written format (and to me, even though it is a
system that includes invisible codes, Unicode is a
system targeted at written languages) that would let
the reader reconstruct the entire ASL gesture
sequence or even understand its meaning from its ASL
form, would be an incredibly painful exercise.

Moreover, since a written form of ASL _has_ been
created, and failed to be adopted, I'm guessing that
there would be no particular call for a "Unicode
version of ASL" in any case.

ASLers who want to convey their ideas in written
form, at least those literate enough to be capable
of any kind of reading/writing, normally write them
in English. [ASL is specifically _American_ sign
language, and depends on the user's knowledge of
American English to convey, by spelling them out,
ideas for which no accepted sign currently exists.]

Note also that there are several formal sign
languages besides ASL (i.e., Mexican sign language,
British sign language) usually quite incompatible
among themselves, so this problem would need solving
many times, and chew up many chunks of Unicode
code-space. I cannot conceive of any value in making
the attempt to encode ASL in a code space of
Unicode, which doesn't deny that someone may make
the attempt anyway.

FWIW

xanthian.

Chris Uppal · May 21, 2006

Oliver said:
I don't think it should make sense to manipulate characters as
integers, just like it doesn't make sense to manipulate Strings which
coincidentally have length 1 as integers.

At one level I agree with you; there's something unnatural about conflating
characters and integers. In fact Smalltalk works exactly how you suggest, and
my own Unicode implementation for Smalltalk (under construction) works that way
too, so I have a fair bit of experience using a system which separates the two
concepts.

But that's only half the story. You also need to be able to do a significant
subset of arithmentical operations on character values (indexing into arrays
for instance), and such operations often turn up in places where constantly
casting back-and-forth between integer code points and actual characters would
be painful and/or inefficient. Java doesn't really support the idea of
"hybrid" values -- half arithmetical, half not -- so, barring major changes to
the language, I'd stick with the current scheme, but make "char" wider.

It's perhaps worth emphasising that, in Unicode, a character has very little
meaning by itself -- it is, in general, not possible to do anything very useful
with a character which isn't an element of a stream or string. Pretty-much the
only things you can legitimately do with a char are compare it with another
char or use it as a lookup index into Unicode character property tables. A
character is /not/ like a short string -- it's a different class of entity
entirely.

Tell you what. How about, since we're redefining Java anyway, we rename "char"
to "codepoint" ? It would be more accurate...

Is this "humans-only" requirement actually documented anywhere?

Not that I know of, although it wouldn't surprise me to find the human-centric
design principles discussed somewhere. Unicode includes rather a lot of
thoughtful and interesting meta-discussion in it's documentation (if not in the
standard itself).

The way that Unicode works is extremely practical and /not/ universal (see
below). It introduces features only if they are used in some target
orthography. Thus it has ligatures, since they are essential in many systems
of writing. It also attempts to make round-tripping from other charsets, into
Unicode, and back possible (no information lost), and so has a very limited
number of Latin ligatures (and that's the /only/ reason it has Latin
ligatures). No writing system uses colour to denote meaning (that I know of)
and so Unicode doesn't touch colour. The result of this YAGNI-like focus on
features that are actually needed, is that Unicode inevitably reflects the
human processes which create written languages, and which determine their
logical structure. One huge example is that human vision uses edge-detection
heavily. As a result Unicode glyphs are /shapes/ -- shapes which can be
rendered as black-on-white.

BTW, don't get mislead by the odd few Unicode code points which are assigned to
non-visual purposes -- the BOM being a good example, or the directionality
markers. There are damned few of those, and for the most part they only exist
in order to allow round-tripping or the use of Unicode in a context where
insufficient meta-information is available, and their use is disouraged in
other contexts. Unicode is /about/ shapes.

It's worth considering how much Unicode /doesn't/ have which it might be
expected to include if the focus weren't so limited. For instance it has no
way of expressing /semantic/ qualifiers on text such as italics (or, more
abstractly, emphasis). It has no means of rendering prosody beyond the limited
expression implied by existing punctuation schemes[*]. Yet if the
text-to-speech example could be taken as a core use for Unicode -- i.e. as a
true alternative rendering of Unicode, on an equal footing with printing text
on paper -- then such annotations would seem to be highly desirable, perhaps
even necessary.

([*] Another aside: apparently English punctuation started out -- with the
Greeks, naturally -- purely as a way of expressing prosody, but at around the
time fully modern English emerged, the punctuation system had its own
mini-revolution: new marks were invented, old marks were reinterpreted or
discarded, and the role of punctuation shifted away from expressing prosody to
expressing grammar and other semantic features of text.)

I
mean, if we found out that, for example, spiders encoded some
communicative information within the patterns of their webs and we
managed to decode it, would it be "against policy" to add symbols from
this spider-language to Unicode? Or would we say "well, now since we, as
humans, have decoded it, it becomes a human writing scheme, and so is apt
to be used in Unicode"?

I don't think it's a policy thing at all. If this situation were ever to
arise, then I think one of two things would happen. Either we humans (not
being able to "see" the patterns properly since we lack the necessary brain
circuitry) would develop an independent glyph-system for representing the
patterns (and whatever other features were needed). In that case the new glyph
system might get added to Unicode if enough humans wanted to represent
Spiderese texts in their discussions with other humans. Note that the spiders
themselves would probably not be able to "see" our human glyphs any more than
we could see theirs, so this system would be solely for human use. This is
roughly what has happened for musical notation[**] Alternatively it might turn
out that human/spider brains were similar enough that we could read their
patterns directly (I have to say that I find this almost impossible to
imagine), in that case it would come down to the practicalities. Does written
Spiderese break down into a glyph system similar enough to the existing human
ones for it to be expressed in the Unicode framework ? I find this even harder
to imagine, but if it /did/ turn out that way then I see no reason for
spider-glyphs not to be added to Unicode. To me (presupposing the existence of
other intelligences at all) it seems much more likely that their communications
wouldn't have a modality which was anywhere near close enough to human writing
to fit into Unicode. Spiders, for instance, might be much more likely to use
moving patterns of standing waves in their webs (vibrations /matter/ to
spiders). Almost any species might naturally record meaning as structures in a
very-high dimensional space -- smell is far more universal on Earth than
vision.

([**] BTW, it seems to me that musical notation is in Unicode because people
want to write /about/ music, not in order to /express/ music per se.)

Your (snipped) point about Unicode assuming sequence is well-taken. Some human
written languages don't make much use of sequence. I can't remember which
off-hand, but some of the old South American languages just bung a number of
symbols/pictures together into a cartoon-like frame, and leave it to the reader
to work out which express a meaning and which qualifies what. It's an
interesting system since it allows a lot of freedom for the writer to be
creative with the pictures and layout. I don't know how such systems would be
mapped into Unicode. It'd be possible, I suppose, to write the symbols down in
an arbitrary, or conventual, order, but I don't know if that would be any use
for scholars, who might want to preserve the spatial layout. If not then
they'd probably be better off using JPEGs instead of Unicode text.

I /think/ I may have worked out where we're seeing Unicode differently.
There's a parallel with dictionaries, which come in two broad flavours. There
are the dictionaries which attempt to /record/ what the (written or not)
language is like at a given time and place (or over a range of such). The OED
is the incomparable exemplar of this school of thought. And then there are the
/prescriptive/ dictionaries -- ones which attempt to tell readers what the
"correct" meaning and spelling of a word is. In the dictionary world the
prescriptive idea has long gone out of fashion[***], and prescriptive
dictionaries are only used for teaching purposes. So, if people start --
say -- confusing "convince" and "persuade", the dictionaries will simply
reflect that in their next edition, whereas a school dictionary will attempt to
dictate that the two words have separate meanings (with a small amount of
overlap).

The parallel here is that I think you are seeing Unicode as non-prescriptive in
that sense, whereas I see it as essentially prescriptive. It's purpose -- as I
see it -- is not to /record/ the diversity of the worlds scripts, but to
/standardise/ their computerised representation. The motive is purely
practical, with no scholarly side to it at all. (Although considerable
scholarship goes into creating it, and it is intended to be used /by/
scholars.) The purpose is only to allow people to share written texts across
different computers -- and for that a prescriptive approach is necessary. A
/standard/.

([***] Since about Samuel Johnson's time, although the idea does resurface from
time to time -- I believe the original Webster's Dictionary was primarily
prescriptive.)

-- chris

Roedy Green · May 21, 2006

In many senses, ASL is a much richer language than
most spoken/written languages.

In ASL, you have the analog ability to emphasise with the grandness of
gesture and exaggeration of the facial expressions. You would not
need to encode that in an ASL symbolic dictionary. Humans are quite
capable of supplying that on their own.

If you look at an ASL dictionary it has stylised pictures with little
arrows to indicate motion. I could imagine someone inventing a
notation that could be read directly or used to generate those images
or "Reboot" style 3D animations, much as Chinese ideograms can be
created from combining radical symbols.

Oliver Wong · May 23, 2006

[Snipped long, but very interesting response -- thanks Chris]

I found Chris' reply very interesting and informative, and it inspired
me to actually go read The Unicode Standard document. I'll copy and paste
interesting block quotes later on in this post, but for the extremely
impatient, here's a bullet point summary.

* The Unicode Standard does set a limit on itself at 0x10FFFF (or just
over a million) characters. I don't know why.
* Unicode deals with abstract semantical concepts of a character, and
not with the glyph, graphic, picture or whatever you want to call it, that
is used to actually visually render that character legible.
* They specifically say that they do not wish to cover "dance
notations".
* One interesting (to me anyway, and in the context of this discussion)
character is U+2062. It's an character which is traditionally invisible
(though I suppose fonts are free to supply a graphic for it) which
represents the mathematical concept of multiplication. That is, in when you
want to write the concept "A times B", you'd write the character 'A', the
U+2062 character, and the character 'B'. This is to distinguish from the
single token "AB", or the sequence 'A', ' ' (the space character), 'B'.
* If you want to submit a proposal for a character or set of characters,
but you're not sure if it's a good idea, there's a mailing list designed
specifically to discuss potential new character submissions. Maybe when I
have free time (read: probably never), I'll submit my supplemental music
characters to the mailing list.

Now for the blockquotes:

http://www.unicode.org/versions/Unicode4.0.0/ch01.pdf

<quote>
Note, however, that the Unicode Standard does not encode idiosyncratic,
personal, novel, or private-use characters, nor does it encode logos or
graphics. Graphologies unrelated to text, such as dance notations, are
likewise outside the scope of the Unicode Standard.
</quote>
page 2

<quote>
The Unicode Standard does not define glyph images. That is, the standard
defines how characters are interpreted, not how glyphs are rendered.
Ultimately, the software or hardware rendering engine of a computer is
responsible for the appearance of the characters on the screen. The Unicode
Standard does not specify the precise shape, size, or orientation of
on-screen characters.
</quote>
page 5

<quote>
Before preparing a proposal, sponsors should note in particular the
distinction between the terms character and glyph as defined in this
standard. Because of this distinction, graphics such as ligatures, conjunct
consonants, minor variant written forms, or abbreviations of longer forms
are generally not acceptable as Unicode characters.
</quote>
page 7

<quote>
Experience has shown that it is often helpful to discuss preliminary
proposals before submitting a detailed proposal. One open forum for such
feedback is the Unicode e-mail discussion list. Please see the Unicode Web
site for instructions on how to subscribe to the mailing list. Sponsors are
urged to send a message of inquiry or a preliminary proposal there before
formal submission. Many problems and questions can be dealt with there.
</quote>
page 7

http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf

<quote>
the Unicode Design Principles are introduced—ten basic principles that
convey the essence of the standard.
</quote>
page 11

<quote>
This single repertoire is intended to be universal in coverage, containing
all the characters for textual representation in all modern writing systems,
in most historic writing systems for which sufficient information is
available to enable reliable encoding of characters, and symbols used in
plain text.
</quote>
page 14

<quote>
The Unicode Standard draws a distinction between characters and glyphs.
Characters are the abstract representations of the smallest components of
written language that have semantic value. They represent primarily, but not
exclusively, the letters, punctuation, and other signs that constitute
natural language text and technical notation. Characters are represented by
code points that reside only in a memory representation, as strings in
memory, or on disk. The Unicode Standard deals only with character codes.

Glyphs represent the shapes that characters can have when they are rendered
or displayed. In contrast to characters, glyphs appear on the screen or
paper as particular representations of one or more characters. A repertoire
of glyphs makes up a font. Glyph shape and methods of identifying and
selecting glyphs are the responsibility of individual font vendors and of
appropriate standards and are not part of the Unicode Standard.
</quote>
page 15

- Oliver

Thomas Hawtin · May 24, 2006

Oliver said:
* The Unicode Standard does set a limit on itself at 0x10FFFF (or
just over a million) characters. I don't know why.

Actually it's 0x108000 code points, in Unicode 2.0 and later. To be
representable in 16-bit form, code points 0x10000 and above are
represented by a pair of surrogate chars. The result of this is that
code points 0xd800 through 0xdfff don't exist. Nice bit of analysis work
there...

Tom Hawtin

Kent Paul Dolan · May 24, 2006

Oliver said:
I found Chris' reply very interesting and
informative, and it inspired me to actually go
read The Unicode Standard document. I'll copy and
paste interesting block quotes later on in this
post, but for the extremely impatient, here's a
bullet point summary.

* The Unicode Standard does set a limit on itself
at 0x10FFFF (or just over a million) characters. I
don't know why.

That seems _really_ tiny. Chinese ideographs are
rumored to number around 100,000, so I suppose
Korean and Japanese chew up about that much more
codespace, and suddenly 30% is already gone. Are
there no other "large code space" written languages
for which similar quantities of code space need to
be available?

* Unicode deals with abstract semantical
concepts of a character, and not with the
glyph, graphic, picture or whatever you want
to call it, that is used to actually visually
render that character legible.

Sort of ... more accurately, I think, is that
Unicode is all about glyphs and providing numeric
codes to correspond to those glyphs, but doesn't
care _in detail_ about the glyph _shape_, just about
the user universe's understanding that a script
glyph for "a" and an OCR glyph for "a" and a "block
outline font" glyph for "a" and a "Courier New"
glyph for "a" are all somehow instances of the
_same_ (abstract) glyph "a".

* They specifically say that they do not wish to
cover "dance notations".

Good, it probably wouldn't have worked well anyway.

* One interesting (to me anyway, and in the
context of this discussion) character is U+2062.
It's an character which is traditionally invisible
(though I suppose fonts are free to supply a
graphic for it) which represents the mathematical
concept of multiplication. That is, in when you
want to write the concept "A times B", you'd write
the character 'A', the U+2062 character, and the
character 'B'. This is to distinguish from the
single token "AB", or the sequence 'A', ' ' (the
space character), 'B'.

That's "slick" indeed, and I suspect that the reason
it is "in there" is to provide a splendidly usable
hint to mathematical typesetting systems, and
perhaps as well to symbolic math manipulation
systems.

Now for the blockquotes:

<quote>
The Unicode Standard does not specify the precise
shape, size, or orientation of on-screen
characters.
</quote>
page 5

Sort of, but that doesn't mean quite as much as a
naive reading might deduce. Characters which are
graphically identical but for orientation, but have
different semantics in different orientatinos, may
still have different Unicode codepoints, such as the
math "grad" and "del" symbols, which are the same
except that one is the vertical inversion of the
other.

<quote>
Before preparing a proposal, sponsors should note in particular the
distinction between the terms character and glyph as defined in this
standard. Because of this distinction, graphics such as ligatures, conjunct
consonants, minor variant written forms, or abbreviations of longer forms
are generally not acceptable as Unicode characters.
</quote>
page 7

And that has so much more complexity than it seems
to have as to be probably 1) a subject of constant
controversy, and 2) to have many, many exceptions.

There are several issues involved.

1) User viewpoint.

To a grade school student of English, "ffl" as in
"snaffle", are three letters, to a typesetter, they
are a single piece of type; to a "codepoint to
glyph" system, they are an artificial intelligence
problem if coded as three codepoints but expected to
be printed as a single glyph when used in "snaffle"
but two glyphs when used in (made up word)
"stafflist". It is _much_ simpler to get from a code
for the ligature to the three individual letters for
alphebetizing, than from the three characters to the
ligature for typesetting, yet rare indeed is the
typewriter with a separately typable ligature. Lots
of the use for Unicode is specifically to support
computer-mediated glyph rendering.

2) User community.

Sometimes just what constitutes a semantic singleton
depends on who is using it. My father, Chester,
tells me that in Spanish, "c", "h", and "ch" are
three separate letters, and alphabetize separately,
though they print unmodified as to kerning and such
when "ch" is the digraph compared, say, to "cb", yet
the "ch" is not a recognizable separate entity to an
English user of the same font software. So, should
(does?) Unicode cater for a separate codepoint for
"ch" for the Hispanic uses?

[This has really silly consequences. My dad, when in
Latin America, was expected to abbreviate his name
as "Ch. V. Dolan" on the title page of his books,
for example.]

3) Practicality.

Ligatures are a bit of a mess simply because there
can be L*M*N combinatorial problems for languages
where a letter can have more than one diacritical
mark [Hebrew, e.g., if I understand correctly], so
chewing up codespace is a real problem when catering
for ligatures of the "c-cedilla" sort.

4) Political concerns.

Telling a user of a language that what that user
considers a separate letter, really doesn't deserve
a codepoint because it can be created as an
overstrike of two existing characters, my simply
offend a whole language use community who'd think
that their own opinion on the "singletonness" of
their version of their alphabet should prevail.

This "offensive to us as a people" problem is in
part why the Gregorian calendar took centuries to be
adopted in Russia, and didn't survive the process
unmodified, with an end result that Russia's
"Gregorian" calendar is the most accurate one
currently in use in the world.

http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf

<quote>
the Unicode Design Principles are introduced --
ten basic principles that convey the essence of
the standard.
</quote>
page 11

<quote>
The Unicode Standard deals only with character codes.
</quote>
page 15

Somehow I doubt that is true in practice. It is true
that Unicode doesn't mandate glyphs, but it isn't
true that it ignores them entirely. Issues of
whether "identical in practice" glyphs between two
ideographic languages are "identical enough in
semantics" to share a common code, or should instead
have separate codes, must come up all the time, and
the issue only exists because the glyphs resemble
one another.

Contrariwise, though, say, (and I don't know if this
is true or not) a Greek "beta" and an English "b"
glyph have the same semantics (each conveys a
mapping to a sound of the spoken language, and for
my hypothesis here, the _same_ sound), yet it is not
the case that Unicode conflates the codepoints, the
codepoints are separate because the glyphs are
unidentical in shape and in origin.

xanthian.

Oliver Wong · May 24, 2006

Kent Paul Dolan said:
Sort of, but that doesn't mean quite as much as a
naive reading might deduce. Characters which are
graphically identical but for orientation, but have
different semantics in different orientatinos, may
still have different Unicode codepoints, such as the
math "grad" and "del" symbols, which are the same
except that one is the vertical inversion of the
other.

I think their point of view is that if two characters have different
semantics, then they are two completely different characters. If in one
given font, the glyph for one character can be obtained by rotating,
translating, flipping, mirroring (or whatever else) the glyph for another
character, that's an issue with that particular font, and not with Unicode.

For example, they have characters which are, for historical reasons,
called "left parenthesis" and "right parenthesis" though now they would
prefer to change the names to "open parenthesis" and "close parenthesis".
The reason being that the so-called left-parenthesis may sometimes have the
glyph '(', and other times the glyph ')'. The latter would occur when the
text is written right-to-left (and thus the ')' glyph is indicating the
opening of a parenthesis, and the '(' that appears later on would be
indicating the closing of the parenthesis).

And that has so much more complexity than it seems
to have as to be probably 1) a subject of constant
controversy, and 2) to have many, many exceptions.

There are several issues involved.

[snipped the ligature stuff 'cause I know nothing about them, and so have no
further comments]

2) User community.

Sometimes just what constitutes a semantic singleton
depends on who is using it. My father, Chester,
tells me that in Spanish, "c", "h", and "ch" are
three separate letters, and alphabetize separately,
though they print unmodified as to kerning and such
when "ch" is the digraph compared, say, to "cb", yet
the "ch" is not a recognizable separate entity to an
English user of the same font software. So, should
(does?) Unicode cater for a separate codepoint for
"ch" for the Hispanic uses?

They discuss this issue in chapter 2
http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf

<quote>
One of the more profound challenges in designing a worldwide character
encoding stems from the fact that, for each text process, written languages
differ in what is considered a fundamental unit of text, or a text element.

For example, in traditional German orthography, the letter combination "ck"
is a text element for the process of hyphenation (where it appears as
"k-k"), but not for the process of sorting; in Spanish, the combination "ll"
may be a text element for the traditional process of sorting (where it is
sorted between "l" and "m"), but not for the process of rendering; and in
English, the letters "A" and "a" are usually distinct text elements for the
process of rendering, but generally not distinct for the process of
searching text. The text elements in a given language depend upon the
specific text process; a text element for spell-checking may have different
boundaries from a text element for sorting purposes. For example, in the
phrase "the quick brown fox", the sequence "fox" is a text element for the
purpose of spell-checking.

However, a character encoding standard provides just the fundamental units
of encoding (that is, the abstract characters), which must exist in a unique
relationship to the assigned numerical code points. Assigned characters are
the smallest interpretable units of stored text.

[...]

The design of the character encoding must provide precisely the set of
characters that allows programmers to design applications capable of
implementing a variety of text processes in the desired languages. These
characters may not map directly to any particular set of text elements that
is used by one of these processes.
</quote>

For compatibility with other encoding systems, Unicode sometimes supports
multiple ways of representing the same character. For example, to represent
the character "Lating captial letter A with grave", you could use the single
codepoint U+00C0, or use the pair of codepoints U+0041 (the character 'A')
and U+0300 (The combining grave accent). The standard requires that the two
sequences are equivalent, and says that they have other documents which
explain the algorithm for "normalizing" a unicode string into a particular
form (I didn't bother to seek out those documents).

- Oliver

Chris Uppal · May 24, 2006

[apologies if the threading gets screwed up, this is as much a reply to Olvier
as it is to Kent]

That seems really tiny. Chinese ideographs are
rumored to number around 100,000, so I suppose
Korean and Japanese chew up about that much more
codespace, and suddenly 30% is already gone.

The Chinese/Japanese/Korean ideographs are unified, so there are "only" about
70 thousand characters between them (in 4.0, more still in later versions)

Unification is an interesting idea: they've taken a (mostly) historical
viewpoint to judge which of the letterforms are logically "the same" in the
three writing systems. Where they are judged to be the same, even though there
may be quite striking differences in the concrete representations in the three
cultures, they are assigned one code point. Rendering them in a
locale-sensitive manner is considered to be a representation issue. (I'm not
sure whether the collation order is likewise unified -- I believe the machinery
is there to have separate collations, but I don't know whether it is used in
this case). There are also code points which are specific to one or the other
flavour of the script, or are in other scripts which don't share a history --
but that's just business as usual...

I get the impression that this unification has required rather a lot of
scholarship. And presumably lashings of diplomacy too.

Of course, unification is not only applied to the CJK ideographs; but I suspect
that an impartial observer would conclude that Latin (and related)
orthographies have been "unified" a lot less aggressively than Chinese. (Mind
you, they take up less space anyway...)

Are
there no other "large code space" written languages
for which similar quantities of code space need to
be available?

I'm no expert, but I haven't heard of any. And my guess would be that it takes
a large and stable civilisation to develop or maintain such a huge "alphabet",
and there simply hasn't been the space or time in world history for another
empire as large as the Chinese to have existed without other people noticing
;-)

Sort of ... more accurately, I think, is that
Unicode is all about glyphs and providing numeric
codes to correspond to those glyphs, but doesn't
care _in detail_ about the glyph shape, just about
the user universe's understanding that a script
glyph for "a" and an OCR glyph for "a" and a "block
outline font" glyph for "a" and a "Courier New"
glyph for "a" are all somehow instances of the
_same_ (abstract) glyph "a".

I agree. One proof of this is that the standard includes sample concrete
glyphs. If the actual shapes (at some level of abstraction) were not important
(indeed central) then
a) There'd be no point in including pictures in the standard.
b) There'd be no way to express /what/ the standard was standardising.

That's "slick" indeed, and I suspect that the reason
it is "in there" is to provide a splendidly usable
hint to mathematical typesetting systems, and
perhaps as well to symbolic math manipulation
systems.

I find it an interesting example too. I think the underlying reasoning is that
in this case the juxtaposition of the characters has a meaning above and beyond
"they are next to each other". So, on the one hand that calls for a character
all by itself, and on the other hand an explicit character may be required to
support round tripping into other representations (TeX? MathML?) which make an
explicit distinction. (Aside: is the same character used for functional
application, or is that a second kind of semantically significant
juxtaposition ?)

And that has so much more complexity than it seems
to have as to be probably 1) a subject of constant
controversy, and 2) to have many, many exceptions.

Nobody ever promised a rose-garden ;-) My own impression (as a total outsider)
is that you are correct on both counts. Still, my impression is also that they
are trying hard to "do it right".

To a grade school student of English, "ffl" as in
"snaffle", are three letters, to a typesetter, they
are a single piece of type; to a "codepoint to
glyph" system, they are an artificial intelligence
problem if coded as three codepoints but expected to
be printed as a single glyph when used in "snaffle"
but two glyphs when used in (made up word)
"stafflist". It is much simpler to get from a code
for the ligature to the three individual letters for
alphebetizing, than from the three characters to the
ligature for typesetting, yet rare indeed is the
typewriter with a separately typable ligature. Lots
of the use for Unicode is specifically to support
computer-mediated glyph rendering.

They are absolutely explicit that ligatures and other typesetting-only concepts
have no place in Unicode. In scripts where ligatures are semantically
significant, then the story is different.

There are exceptions, of course ;-) In the case of ligatures, there are some
Latin ligatures which have been added (with much gnashing of teeth, I suspect)
in order to support round-tripping with well-established charsets.

Sometimes just what constitutes a semantic singleton
depends on who is using it. My father, Chester,
tells me that in Spanish, "c", "h", and "ch" are
three separate letters, and alphabetize separately,
though they print unmodified as to kerning and such
when "ch" is the digraph compared, say, to "cb", yet
the "ch" is not a recognizable separate entity to an
English user of the same font software. So, should
(does?) Unicode cater for a separate codepoint for
"ch" for the Hispanic uses?

I'm pretty sure it doesn't, as far as assigning code-points goes. There may
well be provision for this for when collation is considered, and similar (I'd
guess there is, but I haven't looked).

The important -- I'd imagine -- question is whether Spanish readers can "see"
the c and h in ch. I assume they can (unlike, say, most English speakers who
don't see the e or t in &). If they do then Unicode would treat that as a
compound formation -- calling for attention in collation, etc, but not for a
third code point.

Ligatures are a bit of a mess simply because there
can be L*M*N combinatorial problems for languages
where a letter can have more than one diacritical
mark [Hebrew, e.g., if I understand correctly], so
chewing up codespace is a real problem when catering
for ligatures of the "c-cedilla" sort.

Basically Unicode punts on this, and requires such compounds to be made of
sequences of combining characters. There are lots of examples of characters
which, considered logically, should have been handled like that, but which are
actually given a single code point. The reasons for that are probably nicely
balanced between history, the need for round-tripping, and politics. In any
case, as far as I know, every character 'of the "c-cedilla" sort' can also be
expressed as a combining form. Large chunks of the standard are about how to
derive a canonical representation of a given sequence of characters.

(Incidentally, its this kind of issue which makes single Unicode characters
rather meaningless when taken out of context.)

4) Political concerns.

I can't remember where, but I read that there's one script that hasn't been
added yet, even though it should have been done ages ago. The problem is that
the inhabitants of country X (or its representatives) won't swallow the idea of
their script being unified (partially) with that of country Y. The two
countries have a history of enmity. Country Y's script is already /in/
Unicode, so what the Xers are essentially asking for is have a chunk of
duplicated entries just to satisfy national pride.

As I said, I can't remember where I read that, but if it there's any truth in
it at all then I suspect that the Xers are onto a looser -- nobody else on
Earth gives a damn...

I mean, we English have our share of national pride, and have not been on the
best of terms with the French (or Italians, or Germans, come to that), but no
one is asking for our own national characters for chrissake ;-)

Contrariwise, though, say, (and I don't know if this
is true or not) a Greek "beta" and an English "b"
glyph have the same semantics (each conveys a
mapping to a sound of the spoken language, and for
my hypothesis here, the same sound), yet it is not
the case that Unicode conflates the codepoints, the
codepoints are separate because the glyphs are
unidentical in shape and in origin.

And different in meaning or use. But maybe they'd have been unified anyway if
the Chinese had got there first...

-- chris

Oliver Wong · May 24, 2006

Chris Uppal said:
The Chinese/Japanese/Korean ideographs are unified, so there are "only"
about
70 thousand characters between them (in 4.0, more still in later versions)
[...]

I get the impression that this unification has required rather a lot of
scholarship. And presumably lashings of diplomacy too.

Yes. The Japanese don't write their characters exactly the same way as
the Chinese do and vice versa. Some people aren't too happy that the example
glyphs are drawn the "wrong" way. To me, that's more of a font issue than a
Unicode issue (and in theory, a given codepoint could be rendered in the
"Chinese way" if the rendering-software detected that the locale were China,
and the "Japanese way" if in the Japan locale, etc.), but others argue that
these are distinct characters and should have seperate codepoints all
together. See http://en.wikipedia.org/wiki/Han_unification#Controversy

I agree. One proof of this is that the standard includes sample concrete
glyphs. If the actual shapes (at some level of abstraction) were not
important
(indeed central) then
a) There'd be no point in including pictures in the standard.
b) There'd be no way to express /what/ the standard was standardising.

I think (b) is more important than (a) here. I see the example-glyphs
provided in the standard to help facilicate understanding, but are NOT
mandating a particular shape for the glyph. For example, the example-glyph
for "open parenthesis" is drawn as a curve that can be described as bulging
towards the left. However, the standard itself specifies that if the locale
is a right-to-left one, then the software should probably render it as a
curve bugling towards the right instead. I.e. what you see actually see is
NOT nescessarily the example-glyph given in the standard.

[...]
(Aside: is the same character used for functional
application, or is that a second kind of semantically significant
juxtaposition ?)

Function application is U+2061. U+2063 is "invisible seperator",
presumably to seperate a sequence of items in a list.

- Oliver

[HELP] Add-on - Twitch chat input	0	Aug 31, 2024
How to go about building a crud app when you are a noob	1	Jan 2, 2023
I want to include fees depending on the payment method, using the plugin "Deposits for Woocommerce"	0	Aug 17, 2022
Did you know that there is a match-case function in python?	4	Dec 17, 2023
Want to host websites that I will probably be the only user from home. Sacrilege, I know, but it has always been a dream of mine. Where do I start?	2	Aug 13, 2024
Unable to add task to todo list	1	Sep 25, 2021
How to start, if at all ?	2	Apr 17, 2022
Could you explain this typedef to me?	45	Feb 2, 2014

If you could add anything you want

Oliver Wong

Roedy Green

Roedy Green

Roedy Green

Roedy Green

Roedy Green

Dale King

Kent Paul Dolan

Chris Uppal

Roedy Green

Oliver Wong

Thomas Hawtin

Kent Paul Dolan

Oliver Wong

Chris Uppal

Oliver Wong

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads