JKop said:
I'm writing a prog that'll use Unicode. To represent a
Unicode character, I need a data type that can be set to
65,536 distinct possible values;
There were more than 90,000 possible Unicode characters last
time I looked (there are probably more now).
If you use a 16-bit type to store this, you have to either:
- Ignore characters whose code is > 65535, or
- Use a multi-byte encoding such as UTF-16, and then all of
your I/O functions will have to be UTF-16 aware.
which in the today's world of computing equates to 16 bits.
A bit of mathematical thought will convince you that you need
at least 16 Binary digITs to represent 2^16 values.
wchar_t is the natural choice, but is there any guarantee
in the standard that'll it'll be 16 bits?
No, in fact it's unlikely to be 16 bit. It's only guaranteed to be
able to support "the largest character in all supported locales",
and locales are implementation-dependent, so it could be 8-bit on
a system with no Unicode support.
On MS windows, some compilers (eg. gcc) have 32-bit wchar_t and
some (eg. Borland, Microsoft) have 16-bit. On all other systems
that I've encountered, it is 32-bit.
This is quite annoying (for people whose language falls in the
over-65535 region especially). One can only hope that MS will
eventually come to their senses, or perhaps that someone will
standardise a system of locales.
If you want to write something that's portable to all Unicode
platforms, you will have to use UTF-16 for your strings,
unfortunately. This means you can't use all the standard library
algorithms on them. Define a type "utf16_t" which is an unsigned short.
The only other alternative is to use wchar_t and decode UTF-16
to plain wchar_t (ignoring any characters outside the range of
your wchar_t) whenever you receive a wchar_t string encoded as
UTF-16. (and don't write code that's meant to be used by Chinese).
This might sound a bit odd, but... if an unsigned short
must be atleast 16 bits, then does that *necessarily* mean
that it:
A) Must be able to hold 65,536 distinct values.
B) And be able to store integers in the range 0 -> 65,535 ?
Yes
Furthermore, does a signed short int have to be able to
hold a value between:
A) -32,767 -> 32,768
B) -32,768 -> 32,767
No
I've also heard that some systems are stupid enough (opps!
I mean poorly enough designed) to have two values for zero,
resulting in:
-32,767 -> 32,767
Yes (these are all archaic though, from a practical point of
view you can assume 2's complement, ie. -32768 to 32767).
FWIW the 3 supported systems are (for x > 0):
2's complement: -x == ~x + 1
1's complement: -x == ~x
sign-magnitude: -x = x & (the sign bit)