Unicode questions

T

Tobiah

I've been reading about the Unicode today.
I'm only vaguely understanding what it is
and how it works.

Please correct my understanding where it is lacking.
Unicode is really just a database of character information
such as the name, unicode section, possible
numeric value etc. These points of information
are indexed by standard, never changing numeric
indexes, so that 0x2CF might point to some
character information set, that all the world
can agree on. The actual image that gets
displayed in response to the integer is generally
assigned and agreed upon, but it is up to the
software responding to the unicode value to define
and generate the actual image that will represent that
character.

Now for the mysterious encodings. There is the UTF-{8,16,32}
which only seem to indicate what the binary representation
of the unicode character points is going to be. Then there
are 100 or so other encoding, many of which are language
specific. ASCII encoding happens to be a 1-1 mapping up
to 127, but then there are others for various languages etc.
I was thinking maybe this special case and the others were lookup
mappings, where a
particular language user could work with characters perhaps
in the range of 0-255 like we do for ASCII, but then when
decoding, to share with others, the plain unicode representation
would be shared? Why can't we just say "unicode is unicode"
and just share files the way ASCII users do. Just have a huge
ASCII style table that everyone sticks to. Please enlighten
my vague and probably ill-formed conception of this whole thing.

Thanks,

Tobiah
 
H

Hrvoje Niksic

Tobiah said:
would be shared? Why can't we just say "unicode is unicode"
and just share files the way ASCII users do. Just have a huge
ASCII style table that everyone sticks to.

I'm not sure that I understand you correctly, but UCS-2 and UCS-4
encodings are that kind of thing. Many people prefer UTF-8 because of
convenient backward compatibility with ASCII (and space economy when
dealing with mostly-ascii text).
 
P

Petite Abeille

C

Chris Rebert

I've been reading about the Unicode today.
I'm only vaguely understanding what it is
and how it works.

Petite Abeille already pointed to Joel's excellent primer on the
subject; I can only second their endorsement of his article.
Please correct my understanding where it is lacking.
Now for the mysterious encodings.  There is the UTF-{8,16,32}
which only seem to indicate what the binary representation
of the unicode character points is going to be.  Then there
are 100 or so other encoding, many of which are language
specific.  ASCII encoding happens to be a 1-1 mapping up
to 127, but then there are others for various languages etc.
I was thinking maybe this special case and the others were lookup
mappings, where a
particular language user could work with characters perhaps
in the range of 0-255 like we do for ASCII, but then when
decoding, to share with others, the plain unicode representation
would be shared?

There is no such thing as "plain Unicode representation". The closest
thing would be an abstract sequence of Unicode codepoints (ala
Python's `unicode` type), but this is way too abstract to be used for
sharing/interchange, because storing anything in a file or sending it
over a network ultimately involves serialization to binary, which is
not directly defined for such an abstract representation (Indeed, this
is exactly what encodings are: mappings between abstract codepoints
and concrete binary; the problem is, there's more than one of them).

Python's `unicode` type (and analogous types in other languages) is a
nice abstraction, but at the C level it's actually using some
(implementation-defined, IIRC) encoding to represent itself in memory;
and so when you leave Python, you also leave this implicit, hidden
choice of encoding behind and must instead be quite explicit.
 Why can't we just say "unicode is unicode"
and just share files the way ASCII users do.

Because just "Unicode" itself is not a scheme for encoding characters
as a stream of binary. Unicode /does/ define many encodings, and these
encodings are such schemes; /but/ none of them is *THE* One True
Unambiguous Canonical "Unicode" encoding scheme. Hence, one must be
specific and specify "UTF-8", or "UTF-32", or whatever.

Cheers,
Chris
 
T

Tobiah

There is no such thing as "plain Unicode representation". The closest
thing would be an abstract sequence of Unicode codepoints (ala Python's
`unicode` type), but this is way too abstract to be used for
sharing/interchange, because storing anything in a file or sending it
over a network ultimately involves serialization to binary, which is not
directly defined for such an abstract representation (Indeed, this is
exactly what encodings are: mappings between abstract codepoints and
concrete binary; the problem is, there's more than one of them).

Ok, so the encoding is just the binary representation scheme for
a conceptual list of unicode points. So why so many? I get that
someone might want big-endian, and I see the various virtues of
the UTF strains, but why isn't a handful of these representations
enough? Languages may vary widely but as far as I know, computers
really don't that much. big/little endian is the only problem I
can think of. A byte is a byte. So why so many encoding schemes?
Do some provide advantages to certain human languages?

Thanks,

Toby
 
C

Chris Rebert

Ok, so the encoding is just the binary representation scheme for
a conceptual list of unicode points.  So why so many?  I get that
someone might want big-endian, and I see the various virtues of
the UTF strains, but why isn't a handful of these representations
enough?  Languages may vary widely but as far as I know, computers
really don't that much.  big/little endian is the only problem I
can think of.  A byte is a byte.  So why so many encoding schemes?
Do some provide advantages to certain human languages?

UTF-8 has the virtue of being backward-compatible with ASCII.

UTF-16 has all codepoints in the Basic Multilingual Plane take up
exactly 2 bytes; all others take up 4 bytes. The Unicode people
originally thought they would only include modern scripts, so 2 bytes
would be enough to encode all characters. However, they later
broadened their scope, thus the complication of "surrogate pairs" was
introduced.

UTF-32 has *all* Unicode codepoints take up exactly 4 bytes. This
slightly simplifies processing, but wastes a lot of space for e.g.
English texts.

And then there are a whole bunch of national encodings defined for
backward compatibility, but they typically only encode a portion of
all the Unicode codepoints.

More info: http://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings

Cheers,
Chris
 
T

Terry Reedy

Ok, so the encoding is just the binary representation scheme for
a conceptual list of unicode points. So why so many? I get that
someone might want big-endian, and I see the various virtues of
the UTF strains, but why isn't a handful of these representations
enough? Languages may vary widely but as far as I know, computers
really don't that much. big/little endian is the only problem I
can think of. A byte is a byte. So why so many encoding schemes?
Do some provide advantages to certain human languages?

The hundred or so language-specific encodings all pre-date unicode and
are *not* unicode encodings. They are still used because of inertia and
local optimization.

There are currently about 100000 unicode codepoints, with space for
about 1,000,000. The unicode standard specifies exactly 2 internal
representations of codepoints using either 16 or 32 bit words. The
latter uses one word per codepoint, the former usually uses one word but
has to use two for codepoints above 2**16-1. The standard also specifies
about 7 byte-oriented transer formats, UTF-8,16,32 with big and little
endian variations. As far as I know, these (and a few other variations)
are the only encodings that encode all unicode chars (codepoints)
 
S

Steve Holden

And why do certain people insist on referring to bytes as “octets�

Because back in the old days bytes were of varying sizes on different
architectures - indeed the DECSystem-10 and -20 had instructions that
could be parameterized as to byte size. So octet was an unambiguous term
for the (now standard) 8-bit byte.

regards
Steve
 
S

Seebs

And why do certain people insist on referring to bytes as ???octets????

One common reason is that there have been machines on which "bytes" were
not 8 bits. In particular, the usage of "byte" as "the smallest addressible
storage" has been rather firmly ensconced in the C spec, so people used to
that are likely aware that, on a machine where the smallest directly
addressible chunk of space is 16 bits, it's quite likely that char is 16
bits, and thus by definition a "byte" is 16 bits, and if you want an octet,
you have to extract it from a byte.

-s
 
T

Terry Reedy

Because back in the old days bytes were of varying sizes on different
architectures - indeed the DECSystem-10 and -20 had instructions that
could be parameterized as to byte size. So octet was an unambiguous term
for the (now standard) 8-bit byte.

As I remember, there were machines (CDC? Burroughs?) with 6-bit
char/bytes: 26 upper-case letters, 10 digits, 24 symbols and control chars.
 
S

Steve Holden

As I remember, there were machines (CDC? Burroughs?) with 6-bit
char/bytes: 26 upper-case letters, 10 digits, 24 symbols and control chars.

Yes, and DEC used the same (?) code, calling it SIXBIT. Since their
systems had 36-bit words it packed in very nicely.

regards
Steve
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,049
Latest member
Allen00Reed

Latest Threads

Top