Unicode questions

Tobiah · Oct 19, 2010

I've been reading about the Unicode today.
I'm only vaguely understanding what it is
and how it works.

Please correct my understanding where it is lacking.
Unicode is really just a database of character information
such as the name, unicode section, possible
numeric value etc. These points of information
are indexed by standard, never changing numeric
indexes, so that 0x2CF might point to some
character information set, that all the world
can agree on. The actual image that gets
displayed in response to the integer is generally
assigned and agreed upon, but it is up to the
software responding to the unicode value to define
and generate the actual image that will represent that
character.

Now for the mysterious encodings. There is the UTF-{8,16,32}
which only seem to indicate what the binary representation
of the unicode character points is going to be. Then there
are 100 or so other encoding, many of which are language
specific. ASCII encoding happens to be a 1-1 mapping up
to 127, but then there are others for various languages etc.
I was thinking maybe this special case and the others were lookup
mappings, where a
particular language user could work with characters perhaps
in the range of 0-255 like we do for ASCII, but then when
decoding, to share with others, the plain unicode representation
would be shared? Why can't we just say "unicode is unicode"
and just share files the way ASCII users do. Just have a huge
ASCII style table that everyone sticks to. Please enlighten
my vague and probably ill-formed conception of this whole thing.

Thanks,

Tobiah

Hrvoje Niksic · Oct 19, 2010

Tobiah said:
would be shared? Why can't we just say "unicode is unicode"
and just share files the way ASCII users do. Just have a huge
ASCII style table that everyone sticks to.

I'm not sure that I understand you correctly, but UCS-2 and UCS-4
encodings are that kind of thing. Many people prefer UTF-8 because of
convenient backward compatibility with ASCII (and space economy when
dealing with mostly-ascii text).

Petite Abeille · Oct 19, 2010

Please enlighten my vague and probably ill-formed conception of this whole thing.

Hmmm... is there a question hidden somewhere in there or is it more open ended in nature?

In the meantime...

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
http://www.joelonsoftware.com/articles/Unicode.html

Characters vs. Bytes
http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF

Chris Rebert · Oct 19, 2010

I've been reading about the Unicode today.
I'm only vaguely understanding what it is
and how it works.

Petite Abeille already pointed to Joel's excellent primer on the
subject; I can only second their endorsement of his article.

Please correct my understanding where it is lacking.

Now for the mysterious encodings. Â There is the UTF-{8,16,32}
which only seem to indicate what the binary representation
of the unicode character points is going to be. Â Then there
are 100 or so other encoding, many of which are language
specific. Â ASCII encoding happens to be a 1-1 mapping up
to 127, but then there are others for various languages etc.
I was thinking maybe this special case and the others were lookup
mappings, where a
particular language user could work with characters perhaps
in the range of 0-255 like we do for ASCII, but then when
decoding, to share with others, the plain unicode representation
would be shared?

There is no such thing as "plain Unicode representation". The closest
thing would be an abstract sequence of Unicode codepoints (ala
Python's `unicode` type), but this is way too abstract to be used for
sharing/interchange, because storing anything in a file or sending it
over a network ultimately involves serialization to binary, which is
not directly defined for such an abstract representation (Indeed, this
is exactly what encodings are: mappings between abstract codepoints
and concrete binary; the problem is, there's more than one of them).

Python's `unicode` type (and analogous types in other languages) is a
nice abstraction, but at the C level it's actually using some
(implementation-defined, IIRC) encoding to represent itself in memory;
and so when you leave Python, you also leave this implicit, hidden
choice of encoding behind and must instead be quite explicit.

Â Why can't we just say "unicode is unicode"
and just share files the way ASCII users do.

Because just "Unicode" itself is not a scheme for encoding characters
as a stream of binary. Unicode /does/ define many encodings, and these
encodings are such schemes; /but/ none of them is *THE* One True
Unambiguous Canonical "Unicode" encoding scheme. Hence, one must be
specific and specify "UTF-8", or "UTF-32", or whatever.

Cheers,
Chris

Tobiah · Oct 19, 2010

There is no such thing as "plain Unicode representation". The closest

thing would be an abstract sequence of Unicode codepoints (ala Python's
`unicode` type), but this is way too abstract to be used for
sharing/interchange, because storing anything in a file or sending it
over a network ultimately involves serialization to binary, which is not
directly defined for such an abstract representation (Indeed, this is
exactly what encodings are: mappings between abstract codepoints and
concrete binary; the problem is, there's more than one of them).

Ok, so the encoding is just the binary representation scheme for
a conceptual list of unicode points. So why so many? I get that
someone might want big-endian, and I see the various virtues of
the UTF strains, but why isn't a handful of these representations
enough? Languages may vary widely but as far as I know, computers
really don't that much. big/little endian is the only problem I
can think of. A byte is a byte. So why so many encoding schemes?
Do some provide advantages to certain human languages?

Thanks,

Toby

Petite Abeille · Oct 19, 2010

So why so many encoding schemes?

http://en.wikipedia.org/wiki/Space-time_tradeoff

Chris Rebert · Oct 19, 2010

Ok, so the encoding is just the binary representation scheme for
a conceptual list of unicode points. Â So why so many? Â I get that
someone might want big-endian, and I see the various virtues of
the UTF strains, but why isn't a handful of these representations
enough? Â Languages may vary widely but as far as I know, computers
really don't that much. Â big/little endian is the only problem I
can think of. Â A byte is a byte. Â So why so many encoding schemes?
Do some provide advantages to certain human languages?

UTF-8 has the virtue of being backward-compatible with ASCII.

UTF-16 has all codepoints in the Basic Multilingual Plane take up
exactly 2 bytes; all others take up 4 bytes. The Unicode people
originally thought they would only include modern scripts, so 2 bytes
would be enough to encode all characters. However, they later
broadened their scope, thus the complication of "surrogate pairs" was
introduced.

UTF-32 has *all* Unicode codepoints take up exactly 4 bytes. This
slightly simplifies processing, but wastes a lot of space for e.g.
English texts.

And then there are a whole bunch of national encodings defined for
backward compatibility, but they typically only encode a portion of
all the Unicode codepoints.

More info: http://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings

Cheers,
Chris

Terry Reedy · Oct 19, 2010

Ok, so the encoding is just the binary representation scheme for
a conceptual list of unicode points. So why so many? I get that
someone might want big-endian, and I see the various virtues of
the UTF strains, but why isn't a handful of these representations
enough? Languages may vary widely but as far as I know, computers
really don't that much. big/little endian is the only problem I
can think of. A byte is a byte. So why so many encoding schemes?
Do some provide advantages to certain human languages?

The hundred or so language-specific encodings all pre-date unicode and
are *not* unicode encodings. They are still used because of inertia and
local optimization.

There are currently about 100000 unicode codepoints, with space for
about 1,000,000. The unicode standard specifies exactly 2 internal
representations of codepoints using either 16 or 32 bit words. The
latter uses one word per codepoint, the former usually uses one word but
has to use two for codepoints above 2**16-1. The standard also specifies
about 7 byte-oriented transer formats, UTF-8,16,32 with big and little
endian variations. As far as I know, these (and a few other variations)
are the only encodings that encode all unicode chars (codepoints)

OdarR · Oct 21, 2010

I've been reading about the Unicode today.
I'm only vaguely understanding what it is
and how it works.
....
Thanks,

Tobiah

Hi,

A good advice,
read this presentation,
http://farmdev.com/talks/unicode/
Explanation and advices for coding.

Olivier

Lawrence D'Oliveiro · Oct 25, 2010

Petite said:
Characters vs. Bytes

And why do certain people insist on referring to bytes as â€œoctetsâ€?

Lawrence D'Oliveiro · Oct 25, 2010

Chris Rebert said:
There is no such thing as "plain Unicode representation".

UCS-4 or UTF-16 probably come the closest.

Chris Rebert · Oct 25, 2010

UCS-4 or UTF-16 probably come the closest.

How do you figure that?

Cheers,
Chris

Steve Holden · Oct 25, 2010

And why do certain people insist on referring to bytes as â€œoctetsâ€?

Because back in the old days bytes were of varying sizes on different
architectures - indeed the DECSystem-10 and -20 had instructions that
could be parameterized as to byte size. So octet was an unambiguous term
for the (now standard) 8-bit byte.

regards
Steve

Seebs · Oct 25, 2010

And why do certain people insist on referring to bytes as ???octets????

One common reason is that there have been machines on which "bytes" were
not 8 bits. In particular, the usage of "byte" as "the smallest addressible
storage" has been rather firmly ensconced in the C spec, so people used to
that are likely aware that, on a machine where the smallest directly
addressible chunk of space is 16 bits, it's quite likely that char is 16
bits, and thus by definition a "byte" is 16 bits, and if you want an octet,
you have to extract it from a byte.

-s

Terry Reedy · Oct 25, 2010

Because back in the old days bytes were of varying sizes on different
architectures - indeed the DECSystem-10 and -20 had instructions that
could be parameterized as to byte size. So octet was an unambiguous term
for the (now standard) 8-bit byte.

As I remember, there were machines (CDC? Burroughs?) with 6-bit
char/bytes: 26 upper-case letters, 10 digits, 24 symbols and control chars.

Steve Holden · Oct 25, 2010

As I remember, there were machines (CDC? Burroughs?) with 6-bit
char/bytes: 26 upper-case letters, 10 digits, 24 symbols and control chars.

Yes, and DEC used the same (?) code, calling it SIXBIT. Since their
systems had 36-bit words it packed in very nicely.

regards
Steve

John Nagle · Oct 26, 2010

I've been reading about the Unicode today.
I'm only vaguely understanding what it is
and how it works.

Please correct my understanding where it is lacking.

http://justfuckinggoogleit.com/

Steve Holden · Oct 26, 2010

http://justfuckinggoogleit.com/

Neither friendly nor helpful, John. Silence might have been more
productive: feeling crabby today?

regards Steve

Makefile questions	0	Mar 31, 2025
Flexible string representation, unicode, typography, ...	94	Aug 23, 2012
Inserting Unicode chars in Entry widget	4	Dec 29, 2012
Unicode Chars in Windows Path	12	Apr 2, 2014
Looking for UNICODE to ASCII Conversioni Example Code	15	Oct 18, 2013
Thinking Unicode	0	Aug 8, 2013
Ascii to Unicode.	4	Jul 28, 2010
unicode by default	29	May 11, 2011

Unicode questions

Tobiah

Hrvoje Niksic

Petite Abeille

Chris Rebert

Tobiah

Petite Abeille

Chris Rebert

Terry Reedy

OdarR

Lawrence D'Oliveiro

Lawrence D'Oliveiro

Chris Rebert

Steve Holden

Seebs

Terry Reedy

Steve Holden

John Nagle

Steve Holden

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads