L
Lauri Alanko
I have recently written a number of posts regarding C's wide character
support. It now turns out that my investigation has been in vain:
wchar_t is useless in portable C programming, although I'm not quite
sure whether the standard or implementations are to blame for this. Most
likely both: the standard has sanctioned the implementations'
deficiencies.
I'm working on a library that deals with multilingual strings. The
library only does computation, and doesn't have need for very fancy I/O,
so I'm trying to avoid any unnecessary platform dependencies and make
the library as portable as possible.
One question I'm facing is what kind of representation to use for the
multilingual strings in the public API of the library. Internally, the
library reads some binary data containing UTF-8 strings, so the obvious
answer would be for the public library functions to accept and return
strings in a standard unicode format, either UTF-8 or UTF-32.
But this is not very C-ish. Since C has standard ways to represent
multilingual strings, it's more convenient for the API to use those
standard ways rather than introducing yet another string representation
type. I thought.
So I considered the options. Multibyte strings are not a viable choice,
since their encoding is locale-dependent. If the library communicated
via multibyte strings, then the locale would have to be set to something
that made it possible to represent all the strings that the library had
to deal with.
But a library cannot make requirements on the global locale: libraries
should be components that can be plugged together, and if they begin to
make any requirements on the locale, then they cannot be used together
if the requirements conflict.
I cannot understand why C still only has a global locale. C++ came up
with first-class locales ages ago, and surely nowadays everyone should
know that anything global wreaks havoc to interoperability and
re-entrancy.
So I looked at wchar_t. If __STDC_ISO_10646__ is defined and wchar_t
represents a unicode code point, this would be just perfect. But that's
not the case on all platforms. But that's okay, I thought, as long as I
can (with some platform-dependent magic) convert between unicode code
points and wchar_t.
On Windows, it turns out, wchar_t represents a UTF-16 code unit, so a
code point can require two wchar_t's. That's ugly (and makes <wctype.h>
useless), but not very crucial for my purposes. The important thing is
that sequences of code points can still be encoded to and from wide
_strings_. I could have lived with this.
But then I found out about the killer: on FreeBSD (and Solaris?) the
encoding used by wchar_t is locale-dependent! That is, a single wchar_t
can represent any code point supported by the current locale, but the
same wchar_t value may be used to represent different code points in
different locales. So adopting wchar_t as the representation type would
again make the capabilities of the library dependent on the current
locale, which might be constrained by other parts of the application.
(Also, the locale-dependent wchar_t encodings are quite undocumented, so
the required platform-dependent magic would be magic indeed.)
To recap: C's multibyte strings are in a locale-dependent, possibly
variable-width encoding. On Windows, the wchar_t string encoding is
variable-width, on FreeBSD and Solaris it is locale-dependent. So for
portable C code, wchar_t doesn't provide any advantages over multibyte
strings.
So screw it all, I'll just use UTF-32 like I should have from the
beginning.
Lauri
support. It now turns out that my investigation has been in vain:
wchar_t is useless in portable C programming, although I'm not quite
sure whether the standard or implementations are to blame for this. Most
likely both: the standard has sanctioned the implementations'
deficiencies.
I'm working on a library that deals with multilingual strings. The
library only does computation, and doesn't have need for very fancy I/O,
so I'm trying to avoid any unnecessary platform dependencies and make
the library as portable as possible.
One question I'm facing is what kind of representation to use for the
multilingual strings in the public API of the library. Internally, the
library reads some binary data containing UTF-8 strings, so the obvious
answer would be for the public library functions to accept and return
strings in a standard unicode format, either UTF-8 or UTF-32.
But this is not very C-ish. Since C has standard ways to represent
multilingual strings, it's more convenient for the API to use those
standard ways rather than introducing yet another string representation
type. I thought.
So I considered the options. Multibyte strings are not a viable choice,
since their encoding is locale-dependent. If the library communicated
via multibyte strings, then the locale would have to be set to something
that made it possible to represent all the strings that the library had
to deal with.
But a library cannot make requirements on the global locale: libraries
should be components that can be plugged together, and if they begin to
make any requirements on the locale, then they cannot be used together
if the requirements conflict.
I cannot understand why C still only has a global locale. C++ came up
with first-class locales ages ago, and surely nowadays everyone should
know that anything global wreaks havoc to interoperability and
re-entrancy.
So I looked at wchar_t. If __STDC_ISO_10646__ is defined and wchar_t
represents a unicode code point, this would be just perfect. But that's
not the case on all platforms. But that's okay, I thought, as long as I
can (with some platform-dependent magic) convert between unicode code
points and wchar_t.
On Windows, it turns out, wchar_t represents a UTF-16 code unit, so a
code point can require two wchar_t's. That's ugly (and makes <wctype.h>
useless), but not very crucial for my purposes. The important thing is
that sequences of code points can still be encoded to and from wide
_strings_. I could have lived with this.
But then I found out about the killer: on FreeBSD (and Solaris?) the
encoding used by wchar_t is locale-dependent! That is, a single wchar_t
can represent any code point supported by the current locale, but the
same wchar_t value may be used to represent different code points in
different locales. So adopting wchar_t as the representation type would
again make the capabilities of the library dependent on the current
locale, which might be constrained by other parts of the application.
(Also, the locale-dependent wchar_t encodings are quite undocumented, so
the required platform-dependent magic would be magic indeed.)
To recap: C's multibyte strings are in a locale-dependent, possibly
variable-width encoding. On Windows, the wchar_t string encoding is
variable-width, on FreeBSD and Solaris it is locale-dependent. So for
portable C code, wchar_t doesn't provide any advantages over multibyte
strings.
So screw it all, I'll just use UTF-32 like I should have from the
beginning.
Lauri