T
Tobiah
I've been reading about the Unicode today.
I'm only vaguely understanding what it is
and how it works.
Please correct my understanding where it is lacking.
Unicode is really just a database of character information
such as the name, unicode section, possible
numeric value etc. These points of information
are indexed by standard, never changing numeric
indexes, so that 0x2CF might point to some
character information set, that all the world
can agree on. The actual image that gets
displayed in response to the integer is generally
assigned and agreed upon, but it is up to the
software responding to the unicode value to define
and generate the actual image that will represent that
character.
Now for the mysterious encodings. There is the UTF-{8,16,32}
which only seem to indicate what the binary representation
of the unicode character points is going to be. Then there
are 100 or so other encoding, many of which are language
specific. ASCII encoding happens to be a 1-1 mapping up
to 127, but then there are others for various languages etc.
I was thinking maybe this special case and the others were lookup
mappings, where a
particular language user could work with characters perhaps
in the range of 0-255 like we do for ASCII, but then when
decoding, to share with others, the plain unicode representation
would be shared? Why can't we just say "unicode is unicode"
and just share files the way ASCII users do. Just have a huge
ASCII style table that everyone sticks to. Please enlighten
my vague and probably ill-formed conception of this whole thing.
Thanks,
Tobiah
I'm only vaguely understanding what it is
and how it works.
Please correct my understanding where it is lacking.
Unicode is really just a database of character information
such as the name, unicode section, possible
numeric value etc. These points of information
are indexed by standard, never changing numeric
indexes, so that 0x2CF might point to some
character information set, that all the world
can agree on. The actual image that gets
displayed in response to the integer is generally
assigned and agreed upon, but it is up to the
software responding to the unicode value to define
and generate the actual image that will represent that
character.
Now for the mysterious encodings. There is the UTF-{8,16,32}
which only seem to indicate what the binary representation
of the unicode character points is going to be. Then there
are 100 or so other encoding, many of which are language
specific. ASCII encoding happens to be a 1-1 mapping up
to 127, but then there are others for various languages etc.
I was thinking maybe this special case and the others were lookup
mappings, where a
particular language user could work with characters perhaps
in the range of 0-255 like we do for ASCII, but then when
decoding, to share with others, the plain unicode representation
would be shared? Why can't we just say "unicode is unicode"
and just share files the way ASCII users do. Just have a huge
ASCII style table that everyone sticks to. Please enlighten
my vague and probably ill-formed conception of this whole thing.
Thanks,
Tobiah