wchar_t is useless

Discussion in 'C Programming' started by Lauri Alanko, Nov 21, 2011.

  1. Lauri Alanko

    Lauri Alanko Guest

    I have recently written a number of posts regarding C's wide character
    support. It now turns out that my investigation has been in vain:
    wchar_t is useless in portable C programming, although I'm not quite
    sure whether the standard or implementations are to blame for this. Most
    likely both: the standard has sanctioned the implementations'
    deficiencies.

    I'm working on a library that deals with multilingual strings. The
    library only does computation, and doesn't have need for very fancy I/O,
    so I'm trying to avoid any unnecessary platform dependencies and make
    the library as portable as possible.

    One question I'm facing is what kind of representation to use for the
    multilingual strings in the public API of the library. Internally, the
    library reads some binary data containing UTF-8 strings, so the obvious
    answer would be for the public library functions to accept and return
    strings in a standard unicode format, either UTF-8 or UTF-32.

    But this is not very C-ish. Since C has standard ways to represent
    multilingual strings, it's more convenient for the API to use those
    standard ways rather than introducing yet another string representation
    type. I thought.

    So I considered the options. Multibyte strings are not a viable choice,
    since their encoding is locale-dependent. If the library communicated
    via multibyte strings, then the locale would have to be set to something
    that made it possible to represent all the strings that the library had
    to deal with.

    But a library cannot make requirements on the global locale: libraries
    should be components that can be plugged together, and if they begin to
    make any requirements on the locale, then they cannot be used together
    if the requirements conflict.

    I cannot understand why C still only has a global locale. C++ came up
    with first-class locales ages ago, and surely nowadays everyone should
    know that anything global wreaks havoc to interoperability and
    re-entrancy.

    So I looked at wchar_t. If __STDC_ISO_10646__ is defined and wchar_t
    represents a unicode code point, this would be just perfect. But that's
    not the case on all platforms. But that's okay, I thought, as long as I
    can (with some platform-dependent magic) convert between unicode code
    points and wchar_t.

    On Windows, it turns out, wchar_t represents a UTF-16 code unit, so a
    code point can require two wchar_t's. That's ugly (and makes <wctype.h>
    useless), but not very crucial for my purposes. The important thing is
    that sequences of code points can still be encoded to and from wide
    _strings_. I could have lived with this.

    But then I found out about the killer: on FreeBSD (and Solaris?) the
    encoding used by wchar_t is locale-dependent! That is, a single wchar_t
    can represent any code point supported by the current locale, but the
    same wchar_t value may be used to represent different code points in
    different locales. So adopting wchar_t as the representation type would
    again make the capabilities of the library dependent on the current
    locale, which might be constrained by other parts of the application.
    (Also, the locale-dependent wchar_t encodings are quite undocumented, so
    the required platform-dependent magic would be magic indeed.)

    To recap: C's multibyte strings are in a locale-dependent, possibly
    variable-width encoding. On Windows, the wchar_t string encoding is
    variable-width, on FreeBSD and Solaris it is locale-dependent. So for
    portable C code, wchar_t doesn't provide any advantages over multibyte
    strings.

    So screw it all, I'll just use UTF-32 like I should have from the
    beginning.


    Lauri
    Lauri Alanko, Nov 21, 2011
    #1
    1. Advertising

  2. Lauri Alanko

    Kaz Kylheku Guest

    On 2011-11-21, Lauri Alanko <> wrote:
    > I have recently written a number of posts regarding C's wide character
    > support. It now turns out that my investigation has been in vain:
    > wchar_t is useless in portable C programming, although I'm not quite


    That is false; it is useful.

    > So I looked at wchar_t. If __STDC_ISO_10646__ is defined and wchar_t
    > represents a unicode code point, this would be just perfect. But that's
    > not the case on all platforms. But that's okay, I thought, as long as I
    > can (with some platform-dependent magic) convert between unicode code
    > points and wchar_t.


    wchar_t is an integral type which represents an integral value. It does not
    represent a code point any more than "char" represents an ASCII value.

    > On Windows, it turns out, wchar_t represents a UTF-16 code unit, so a
    > code point can require two wchar_t's. That's ugly (and makes <wctype.h>


    This is a limitation of Windows. The Windows API uses 16 bit wide characters,
    so you can't get away from this no matter what language you write in on
    Windows.

    Redmond has decided that characters outside of the Unicode BMP (basic
    multilingual plane) are unimportant for its user base. So, if your programs has
    customers who are Windows users, you can safely assume that they have already
    swallowed this pill.

    You get a lot of internationalization mileage out of the BMP. Actually all the
    mileage. Above U+FFFF there is only academic crap. Anyone who cares about those
    characters is likely also going to be some kind of "freetard" who won't pay a
    dime for software.

    > But then I found out about the killer: on FreeBSD (and Solaris?) the
    > encoding used by wchar_t is locale-dependent!


    I would expect this "locale dependent" to mean that if, say, Japanese user is
    working with Shift-JIS files, then he or she can set that up in the locale such
    that when these files are processed by your program, the characters being read
    and written map to sane values of wchar_t (where sane == based on Unicode!).

    wchar_t does not have an encoding; it's an integral type. The encoding
    of wchar_t is binary enumeration: 000...0101 encodes 5, etc.

    Do you have some quotes from FreeBSD or Solaris documentation on this matter
    that are giving you concern? Post them.

    > So screw it all, I'll just use UTF-32 like I should have from the
    > beginning.


    But that just means you have to write your own library instead of just using
    C95 functions like wscspn, wcscpy, etc. What if you want to do printf-like
    formatting to a wide string? Can't use swprintf.

    Here is a better idea: just use wchar_t, forget about U+1XXXX on Windows
    because Microsoft has decided that one for your users already, and if
    locale-dependent streams give you an allergic reaction, handle your own
    decoding/encoding for doing I/O.
    Kaz Kylheku, Nov 21, 2011
    #2
    1. Advertising

  3. Lauri Alanko

    James Kuyper Guest

    On 11/21/2011 11:17 AM, Kaz Kylheku wrote:
    ....
    > You get a lot of internationalization mileage out of the BMP. Actually all the
    > mileage. Above U+FFFF there is only academic crap.


    Academics have a need for software support too. One of my old friends
    has a husband who works mainly with dead languages; when I met him in
    1990 he could read and write 14 of them; he's probably added more since
    then. I suspect he would find software that supported Plane 1 useful.

    The sources I checked didn't give any indication how adequate the BMP
    characters are for representing Chinese text. If the Unified Han
    Ideographs in Plane 2 are in fact needed for some purpose, there's a
    very large number of Chinese who would need them. That's hardly an
    academic issue.

    > ... Anyone who cares about those
    > characters is likely also going to be some kind of "freetard" who won't pay a
    > dime for software.


    As a professional software developer myself, I fully agree with the idea
    of paying people for their work. However, why should anyone buy software
    when they have free software available that is of acceptable quality,
    containing all the features they require?

    The only reason I can make money developing software is that there's no
    one willing to give away software that does what mine does, and that's
    just the way it should be.
    James Kuyper, Nov 21, 2011
    #3
  4. Lauri Alanko

    Ben Pfaff Guest

    Lauri Alanko <> writes:

    > To recap: C's multibyte strings are in a locale-dependent, possibly
    > variable-width encoding. On Windows, the wchar_t string encoding is
    > variable-width, on FreeBSD and Solaris it is locale-dependent. So for
    > portable C code, wchar_t doesn't provide any advantages over multibyte
    > strings.


    I agree with you.

    The libunistring manual has a section that says pretty much what
    you did in your message, by the way:
    http://www.gnu.org/software/libunistring/manual/libunistring.html#The-wchar_005ft-mess
    --
    char a[]="\n .CJacehknorstu";int putchar(int);int main(void){unsigned long b[]
    ={0x67dffdff,0x9aa9aa6a,0xa77ffda9,0x7da6aa6a,0xa67f6aaa,0xaa9aa9f6,0x11f6},*p
    =b,i=24;for(;p+=!*p;*p/=4)switch(0[p]&3)case 0:{return 0;for(p--;i--;i--)case+
    2:{i++;if(i)break;else default:continue;if(0)case 1:putchar(a[i&15]);break;}}}
    Ben Pfaff, Nov 21, 2011
    #4
  5. Lauri Alanko

    Jack McCue Guest

    Ben Pfaff <> wrote:
    > Lauri Alanko <> writes:
    >
    >> To recap: C's multibyte strings are in a locale-dependent, possibly

    <snip>
    >
    > I agree with you.

    ditto
    >
    > The libunistring manual has a section that says pretty much what
    > you did in your message, by the way:
    > http://www.gnu.org/software/libunistring/manual/libunistring.html#The-wchar_005ft-mess


    Thanks for the URL, I struggled with wchar_t
    on AIX for a bit then ended up writing a set of
    small functions, my needs were simple at the time.
    At least I know why I had a hard time, thought I was
    missing something. maybe I still am :)

    Regards,
    Jack
    Jack McCue, Nov 21, 2011
    #5
  6. Lauri Alanko

    Kaz Kylheku Guest

    On 2011-11-21, Ben Pfaff <> wrote:
    > Lauri Alanko <> writes:
    >
    >> To recap: C's multibyte strings are in a locale-dependent, possibly
    >> variable-width encoding. On Windows, the wchar_t string encoding is
    >> variable-width, on FreeBSD and Solaris it is locale-dependent. So for
    >> portable C code, wchar_t doesn't provide any advantages over multibyte
    >> strings.

    >
    > I agree with you.
    >
    > The libunistring manual has a section that says pretty much what
    > you did in your message, by the way:
    > http://www.gnu.org/software/libunistring/manual/libunistring.html#The-wchar_005ft-mess


    It probably pretty much says the same thing, because quite likely that text is
    the source for Lauri's opinion, or both have some other common source. For
    instance, look at this:

    ``On Solaris and FreeBSD, the wchar_t encoding is locale dependent and undocumented.''

    Eerie similarity!

    I don't agree with this libunistring manual. The wchar_t type is useful and
    just fine.

    The are right about the limitation of Windows, but nobody ever went wrong in
    accepting the limitations of Microsoft Windows in order to write software for
    users of Windows who have also accepted those limitations.

    If you want to do processing with rare languages on Windows, install a virtual
    machine running GNU/Linux and you have 32 bit wchar_t. GNU/Linux is more
    likely than Redmond to have fonts to display your rare languages, too.

    Cleraly the libunistring authors they don't understand what Solaris and FreeBSD
    means by "encoding" (and they do not care whether they are right or wrong because,
    after all, they have a library which will fix the FreeBSD or Solaris problem,
    regardless of whether it is real or imagined.) Hey, a user who needlessly uses your
    library is still a user!

    And undocumented, by the way? Uh, use the source, Luke?

    Oh, and The Single Unix Specification, Issue 6, says this about wchar_t:

    wchar_t

    Integer type whose range of values can represent distinct wide-character
    codes for all members of the largest character set specified among the
    locales supported by the compilation environment: the null character has
    the code value 0 and each member of the portable character set has a code
    value equal to its value when used as the lone character in an integer
    character constant.

    I very much doubt that FreeBSD and Solaris go against the grain on this one
    in any way.
    Kaz Kylheku, Nov 21, 2011
    #6
  7. Lauri Alanko

    Lauri Alanko Guest

    In article <>,
    Kaz Kylheku <> wrote:
    > On 2011-11-21, Ben Pfaff <> wrote:
    > http://www.gnu.org/software/libunistring/manual/libunistring.html#The-wchar_005ft-mess
    >
    > It probably pretty much says the same thing, because quite likely that text is
    > the source for Lauri's opinion, or both have some other common source.


    That is indeed where I learned about the locale-dependency of wchar_t.
    I found it hard to believe myself, so I checked.

    http://svnweb.freebsd.org/base/head/lib/libc/locale/euc.c?revision=172619&view=markup

    Here we have the following:

    199 for (i = (es->want == 0) ? 1 : 0; i < MIN(want, n); i++) {
    200 if (*s == '\0') {
    201 errno = EILSEQ;
    202 return ((size_t)-1);
    203 }
    204 wc = (wc << 8) | (unsigned char)*s++;
    205 }

    That is, in the EUC locale, the wchar_t value of a character consists
    of just the bits of the variable-width encoding of that character in
    EUC. From quick perusing of the source, other variable-width encodings
    seem to work the same way, except for utf8.c, which decodes the code
    point and stores that in wchar_t.

    As for solaris, I tried it out:


    $ uname -a
    SunOS kruuna.helsinki.fi 5.10 Generic_127111-05 sun4u sparc
    $ cat wc.c
    #include <stdio.h>
    #include <wchar.h>
    #include <locale.h>

    int main(int argc, char* argv[]) {
    setlocale(LC_CTYPE, argv[1]);
    wchar_t wc = fgetwc(stdin);
    printf("%08lx\n", (unsigned long) wc);
    return 0;
    }
    $ echo -e '\xa4' | ./wc fi_FI.ISO8859-1 # U+00A4 CURRENCY SIGN
    30000024
    $ echo -e '\xa4' | ./wc fi_FI.ISO8859-15 # U+20AC EURO SIGN
    30000024
    $ echo -e '\xa4' | iconv -f iso-8859-1 -t utf-8 | ./wc fi_FI.UTF-8
    000000a4
    $ echo -e '\xa4' | iconv -f iso-8859-15 -t utf-8 | ./wc fi_FI.UTF-8
    000020ac


    Frankly, I cannot understand how platforms like these could support
    C1X where wide string literals (whose encoding has to be decided at
    compile time before any locale is selected) can contain unicode
    escapes.


    Lauri
    Lauri Alanko, Nov 21, 2011
    #7
  8. Lauri Alanko

    Kaz Kylheku Guest

    On 2011-11-21, Lauri Alanko <> wrote:
    > In article <>,
    > Kaz Kylheku <> wrote:
    >> On 2011-11-21, Ben Pfaff <> wrote:
    >> http://www.gnu.org/software/libunistring/manual/libunistring.html#The-wchar_005ft-mess
    >>
    >> It probably pretty much says the same thing, because quite likely that text is
    >> the source for Lauri's opinion, or both have some other common source.

    >
    > That is indeed where I learned about the locale-dependency of wchar_t.
    > I found it hard to believe myself, so I checked.
    >
    > http://svnweb.freebsd.org/base/head/lib/libc/locale/euc.c?revision=172619&view=markup
    >
    > Here we have the following:
    >
    > 199 for (i = (es->want == 0) ? 1 : 0; i < MIN(want, n); i++) {
    > 200 if (*s == '\0') {
    > 201 errno = EILSEQ;
    > 202 return ((size_t)-1);
    > 203 }
    > 204 wc = (wc << 8) | (unsigned char)*s++;
    > 205 }


    So it's obvious here that a wchar_t does not have an encoding. Some other
    encoding is being decoded, and that becomes the value of wchar_t.

    > That is, in the EUC locale, the wchar_t value of a character consists
    > of just the bits of the variable-width encoding of that character in
    > EUC. From quick perusing of the source, other variable-width encodings
    > seem to work the same way, except for utf8.c, which decodes the code
    > point and stores that in wchar_t.


    But is that wrong?

    Decoding the utf8 code point is certainly right.

    Based on anything you know about EUC (I know nothing), is EUC being handled
    properly above? (Furthermore, do you care about the EUC encoding?)

    This code is inside the mbrtowc function. Of course mbrtowc is
    locale-dependent, by design. It converts multibyte strings to wchar_t, and it
    has to do so according to an encoding! This function is locale-dependent,
    not the wchar_t type. (And you don't have to use this function.)

    Definitely, it's a good idea to do your own encoding and decoding, for
    portability, at least in some kinds of programs.

    The ISO C standard gives us this:

    "At program startup, the equivalent of:

    setlocale(LC_ALL, "C");

    is executed."

    So in C, you are automatically in the safe "C" locale, which specifies the
    "minimal environment for C translation". You're insulated from the effect of
    the native environment locale until you explicitly call setlocale(LC_ALL, "").

    If you don't want to do localization using the C library, just don't
    call setlocale, and do all your own converting from external formats. You can
    still use wchar_t. Just don't use wide streams, don't use mbstowcs, etc.

    By the way, feel free to take any code (BSD licensed) from here:

    http://www.kylheku.com/cgit/txr/tree/

    I've handled the internationalization of the program by restricting
    all I/O to utf-8 and using wchar_t to store characters. On Cygwin and Win32,
    text is resticted to U+0000 through U+FFFF. Users who find that
    lacking can use a better OS. Problem solved.

    > Frankly, I cannot understand how platforms like these could support
    > C1X where wide string literals (whose encoding has to be decided at
    > compile time before any locale is selected) can contain unicode
    > escapes.


    Simply by treating all conversions to wchar_t as targetting a common
    representation (Unicode).

    So for instance, suppose you have the character 野 in a literal, perhaps as a
    UTF-8 multibyte character, or a \u sequence in ASCII. This maps to a wchar_t
    which has the Unicode value.

    The user is in a Shift-JIS locale, and inputs a string which contains
    野 in Shift-JIS encoding. You convert that to wchar_t using the correct
    locale and, and blam: same character code as what came from your string
    literal.
    Kaz Kylheku, Nov 21, 2011
    #8
  9. Lauri Alanko

    Lauri Alanko Guest

    In article <>,
    Kaz Kylheku <> wrote:
    > On 2011-11-21, Lauri Alanko <> wrote:
    > > 199 for (i = (es->want == 0) ? 1 : 0; i < MIN(want, n); i++) {
    > > 200 if (*s == '\0') {
    > > 201 errno = EILSEQ;
    > > 202 return ((size_t)-1);
    > > 203 }
    > > 204 wc = (wc << 8) | (unsigned char)*s++;
    > > 205 }

    >
    > So it's obvious here that a wchar_t does not have an encoding. Some other
    > encoding is being decoded, and that becomes the value of wchar_t.


    That is a very strange way of putting it. Certainly wchar_t has _an_
    encoding, that is, a mapping between abstract characters and integer
    values. (In Unicode terminology, it's a "coded character set".)

    The euc.c module is a bit of a complex example, since it is
    parameterized (as there are many variants of EUC):

    http://www.gsp.com/cgi-bin/man.cgi?section=5&topic=euc

    Even the man page explicitly says that the encoding of wchar_t is
    dependent on the precise definition of the locale. For instance,
    character for love (U+611B), which is encoded in EUC-JP as "\xb0\xa6"
    is represented by the wchar_t value 0xb0a6.

    > > That is, in the EUC locale, the wchar_t value of a character consists
    > > of just the bits of the variable-width encoding of that character in
    > > EUC. From quick perusing of the source, other variable-width encodings
    > > seem to work the same way, except for utf8.c, which decodes the code
    > > point and stores that in wchar_t.

    >
    > But is that wrong?


    No _single_ encoding is wrong, the problem is that these different
    locales have different encodings for wchar_t. In the utf-8 locale, the
    character for love is represented by the wchar_t value 0x611b. So now
    if I want my library to input and output wchar_t values, _I need to
    know which locale they were produced on_ in order to know how to
    interpret them.

    > This code is inside the mbrtowc function. Of course mbrtowc is
    > locale-dependent, by design. It converts multibyte strings to wchar_t, and it
    > has to do so according to an encoding! This function is locale-dependent,
    > not the wchar_t type.


    The standard library functions, and wide string literals, are what
    imbue wchar_t values with an indended interpretation as characters.
    Without the intended interpretation, wchar_t would just be a plain
    integer type that wouldn't fulfill any function that other integer
    types wouldn't already.

    > Definitely, it's a good idea to do your own encoding and decoding, for
    > portability, at least in some kinds of programs.


    I'm not concerned with external encodings (other than UTF-8, which is
    used by a certain file format I process). I can let the user of my
    library worry about those. I'm concerned with the API, and the choice
    of representation for strings. It's not only a question of choosing a
    type, there must also be an interpretation for values of that type.
    And for wchar_t, it seems, the interpretation can be quite volatile.

    > If you don't want to do localization using the C library, just don't
    > call setlocale, and do all your own converting from external formats.


    I'm writing a _library_. As I explained earlier, a library cannot
    control, or constrain, the current locale. Perhaps someone would like
    to plug the library into a legacy application that needs to be run
    in a certain locale. As a library writer, it's my job to make sure
    that this is possible without pain.

    > You can still use wchar_t. Just don't use wide streams, don't use
    > mbstowcs, etc.


    I indeed do not need to use those, but the user of the library
    presumably might. Now suppose someone calls a function in my library,
    and I wish to return the character for love as a wchar_t. Now how can
    I know which wchar_t value I should return?

    > I've handled the internationalization of the program by restricting
    > all I/O to utf-8 and using wchar_t to store characters. On Cygwin and Win32,
    > text is resticted to U+0000 through U+FFFF. Users who find that
    > lacking can use a better OS. Problem solved.


    It's curious that you find this particular limitation of Windows to be
    significant. It's a nuisance, sure, but I don't see why it would be so
    important to have a single wchar_t value represent a whole code point.
    The only important operations on individual wchar_t's are those in
    <wctype.h>, but if you need to classify code points at all, you are soon
    likely to need more detailed access to Unicode character properties
    that goes beyond what <wctype.h> provides.

    And if you need to split a piece of text into discrete units, I don't
    see why code points, especially of unnormalized or NFC-normalized
    text, would be any more important units than, say, grapheme clusters.

    > > Frankly, I cannot understand how platforms like these could support
    > > C1X where wide string literals (whose encoding has to be decided at
    > > compile time before any locale is selected) can contain unicode
    > > escapes.

    >
    > Simply by treating all conversions to wchar_t as targetting a common
    > representation (Unicode).


    You mean, rewriting all those locale modules so that wchar_t always
    has a consistent value (the unicode code point) for a given character,
    regardless of the way it is encoded in the current module?

    That's effectively what I was saying: those platforms, as they
    currently stand, cannot have locale-independent unicode literals, so
    they have to be modified.

    But actually, I'm not quite sure if C1X really requires unicode
    literals to be locale-independent. The text on character constants,
    string literals and universal character names is really confusing, and
    there's talk about "an implementation-dependent current locale", so it
    might be that even C1X allows the meaning of wide string literals to
    vary between locales. It'd be a shame if this is true.


    Lauri
    Lauri Alanko, Nov 22, 2011
    #9
  10. Lauri Alanko

    Kaz Kylheku Guest

    On 2011-11-22, Lauri Alanko <> wrote:
    > For instance,
    > character for love (U+611B), which is encoded in EUC-JP as "\xb0\xa6"
    > is represented by the wchar_t value 0xb0a6.


    Ah, if that's the case, that is pretty broken. \xb0\xa6 should decode into the
    value 0x611B. They ducked out of doing it right, didn't they. Perhaps this
    preserves the behavior of legacy programs which expect the mapping
    to work that way.

    >> Simply by treating all conversions to wchar_t as targetting a common
    >> representation (Unicode).

    >
    > You mean, rewriting all those locale modules so that wchar_t always
    > has a consistent value (the unicode code point) for a given character,


    Or not using/supporting the locales that don't produce Unicode code points.
    You can treat those as weird legacy cruft, like EBCDIC. Find out what works,
    and document that as being supported. "This is a Unicode program, whose
    embedded strings are in Unicode, and which requires a Unicode-compatible
    locale."

    Either way, you don't have to throw out the wchar_t. It is handy because it's
    supported in the form of string literals, and some useful functions like
    wcsspn, wcschr, etc.

    I think you have to regard these two problems as being completely separate:

    - writing software that is multilingual.
    - targetting two or more incompatible ways of being multilingual,
    simultaneously in the same program. (incompatible meaning that the
    internal representation for characters follows a different map.)

    I think you're taking too much into your scope: you want to solve both
    problems, and so then when you look at this FreeBSD mess, it looks
    intractable.

    Solve the first problem, and forget the second.

    The only people needing to solve the second problem are those who
    are saddled with legacy support requirements, like having to continue
    being able to read data from 20 year old versions of the software.
    Kaz Kylheku, Nov 22, 2011
    #10
  11. Lauri Alanko

    Lauri Alanko Guest

    In article <>,
    Kaz Kylheku <> wrote:
    > On 2011-11-22, Lauri Alanko <> wrote:
    > > For instance,
    > > character for love (U+611B), which is encoded in EUC-JP as "\xb0\xa6"
    > > is represented by the wchar_t value 0xb0a6.

    >
    > Ah, if that's the case, that is pretty broken. \xb0\xa6 should decode into the
    > value 0x611B.


    It should, if __STDC_ISO_10646__ were defined. The standard doesn't
    require it to be.

    > They ducked out of doing it right, didn't they.


    BSD predates Unicode. I can quite well understand that old locale code
    doesn't map characters into their unicode values. What I didn't expect
    is that there would be no attempt to keep the non-standard mappings
    consistent between locales.

    > Or not using/supporting the locales that don't produce Unicode code points.
    > You can treat those as weird legacy cruft, like EBCDIC.


    I have worked on an EBCDIC platform. They are real.

    > Find out what works,
    > and document that as being supported. "This is a Unicode program, whose
    > embedded strings are in Unicode, and which requires a Unicode-compatible
    > locale."


    As I was saying, wchar_t is useless in portable C programming. You
    seem to be concurring, although in a roundabout way.

    > Either way, you don't have to throw out the wchar_t. It is handy because it's
    > supported in the form of string literals, and some useful functions like
    > wcsspn, wcschr, etc.


    None of those are particularly useful for me. I wouldn't be using
    wchar_t* for data storage or processing anyway, just for interchange.

    And again, I don't see a single code point as being a very meaningful
    unit of text. If you need to search for a piece of text, you most
    likely need to search for a substring.

    The only use wchar_t could have had for me was if it had been an
    established, well-defined way of representing multilingual text. As it
    stands, it is far too underdefined, so it has no real use for me.

    > I think you have to regard these two problems as being completely separate:
    >
    > - writing software that is multilingual.
    > - targetting two or more incompatible ways of being multilingual,
    > simultaneously in the same program. (incompatible meaning that the
    > internal representation for characters follows a different map.)
    >
    > I think you're taking too much into your scope: you want to solve both
    > problems, and so then when you look at this FreeBSD mess, it looks
    > intractable.
    >
    > Solve the first problem, and forget the second.


    That's what I'm doing, and that's why I'm going to forget about wchar_t.


    Lauri
    Lauri Alanko, Nov 22, 2011
    #11
  12. Lauri Alanko

    Dann Corbit Guest

    In article <>, says...
    >
    > On 2011-11-21, Lauri Alanko <> wrote:

    [snip]
    >
    > > So screw it all, I'll just use UTF-32 like I should have from the
    > > beginning.



    Or you could use this, which is what every sensible person does:
    http://www-01.ibm.com/software/globalization/icu/
    Dann Corbit, Nov 22, 2011
    #12
  13. Lauri Alanko

    Joe keane Guest

    No one should work on I18N unless they're Finnish, Hungarian, or Japanese.
    Joe keane, Nov 24, 2011
    #13
  14. Lauri Alanko

    Kaz Kylheku Guest

    On 2011-11-24, Joe keane <> wrote:
    > No one should work on I18N unless they're Finnish, Hungarian, or Japanese.


    It's no longer realistic to write programs that do any text processing, but
    handle only 8 bit text, even if those programs are not actually multilingual.

    (I mean programs for the world to use, not just for use by the author and maybe
    a few of his ISO-latin-character-using colleagues.)
    Kaz Kylheku, Nov 24, 2011
    #14
  15. (Joe keane) writes:
    > No one should work on I18N unless they're Finnish, Hungarian, or Japanese.


    Do you mean that speakers of those languages have additional insight
    into internationalization issues? If so, you have a point, but
    it's vastly overstated (deliberately so for effect, I presume).

    --
    Keith Thompson (The_Other_Keith) <http://www.ghoti.net/~kst>
    "We must do something. This is something. Therefore, we must do this."
    -- Antony Jay and Jonathan Lynn, "Yes Minister"
    Keith Thompson, Nov 24, 2011
    #15
  16. Lauri Alanko

    James Kuyper Guest

    On 11/24/2011 04:16 PM, Kaz Kylheku wrote:
    > On 2011-11-24, Joe keane <> wrote:
    >> No one should work on I18N unless they're Finnish, Hungarian, or Japanese.

    >
    > It's no longer realistic to write programs that do any text processing, but
    > handle only 8 bit text, even if those programs are not actually multilingual.
    >
    > (I mean programs for the world to use, not just for use by the author and maybe
    > a few of his ISO-latin-character-using colleagues.)


    Not even if those "few colleagues" number in the hundreds of millions?
    I'm in favor of I18N, but I also work is one of the largest markets in
    the world where it's quite feasible to make a decent profit on software
    that has support for only one language.
    --
    James Kuyper
    James Kuyper, Nov 24, 2011
    #16
  17. On Monday, November 21, 2011 10:14:35 PM UTC+8, Lauri Alanko wrote:
    > I have recently written a number of posts regarding C's wide character
    > support. It now turns out that my investigation has been in vain:
    > wchar_t is useless in portable C programming, although I'm not quite
    > sure whether the standard or implementations are to blame for this. Most
    > likely both: the standard has sanctioned the implementations'
    > deficiencies.
    >
    > I'm working on a library that deals with multilingual strings. The
    > library only does computation, and doesn't have need for very fancy I/O,
    > so I'm trying to avoid any unnecessary platform dependencies and make
    > the library as portable as possible.
    >
    > One question I'm facing is what kind of representation to use for the
    > multilingual strings in the public API of the library. Internally, the
    > library reads some binary data containing UTF-8 strings, so the obvious
    > answer would be for the public library functions to accept and return
    > strings in a standard unicode format, either UTF-8 or UTF-32.
    >
    > But this is not very C-ish. Since C has standard ways to represent
    > multilingual strings, it's more convenient for the API to use those
    > standard ways rather than introducing yet another string representation
    > type. I thought.
    >
    > So I considered the options. Multibyte strings are not a viable choice,
    > since their encoding is locale-dependent. If the library communicated
    > via multibyte strings, then the locale would have to be set to something
    > that made it possible to represent all the strings that the library had
    > to deal with.
    >
    > But a library cannot make requirements on the global locale: libraries
    > should be components that can be plugged together, and if they begin to
    > make any requirements on the locale, then they cannot be used together
    > if the requirements conflict.
    >
    > I cannot understand why C still only has a global locale. C++ came up
    > with first-class locales ages ago, and surely nowadays everyone should
    > know that anything global wreaks havoc to interoperability and
    > re-entrancy.
    >
    > So I looked at wchar_t. If __STDC_ISO_10646__ is defined and wchar_t
    > represents a unicode code point, this would be just perfect. But that's
    > not the case on all platforms. But that's okay, I thought, as long as I
    > can (with some platform-dependent magic) convert between unicode code
    > points and wchar_t.
    >
    > On Windows, it turns out, wchar_t represents a UTF-16 code unit, so a
    > code point can require two wchar_t's. That's ugly (and makes <wctype.h>
    > useless), but not very crucial for my purposes. The important thing is
    > that sequences of code points can still be encoded to and from wide
    > _strings_. I could have lived with this.
    >
    > But then I found out about the killer: on FreeBSD (and Solaris?) the
    > encoding used by wchar_t is locale-dependent! That is, a single wchar_t
    > can represent any code point supported by the current locale, but the
    > same wchar_t value may be used to represent different code points in
    > different locales. So adopting wchar_t as the representation type would
    > again make the capabilities of the library dependent on the current
    > locale, which might be constrained by other parts of the application.
    > (Also, the locale-dependent wchar_t encodings are quite undocumented, so
    > the required platform-dependent magic would be magic indeed.)
    >
    > To recap: C's multibyte strings are in a locale-dependent, possibly
    > variable-width encoding. On Windows, the wchar_t string encoding is
    > variable-width, on FreeBSD and Solaris it is locale-dependent. So for
    > portable C code, wchar_t doesn't provide any advantages over multibyte
    > strings.
    >
    > So screw it all, I'll just use UTF-32 like I should have from the
    > beginning.
    >
    >
    > Lauri


    The c string part is too slow in many applications nowadays by the default standard way of no length tagged but a terminator marked. That was goodto teach pointers and opened the door to write assembly and c together inmany platforms to support a c compiler long long time ago!
    88888 Dihedral, Nov 25, 2011
    #17
  18. Lauri Alanko

    Rui Maciel Guest

    Joe keane wrote:

    > No one should work on I18N unless they're Finnish, Hungarian, or Japanese.


    Why is that?


    Rui Maciel
    Rui Maciel, Nov 25, 2011
    #18
  19. Lauri Alanko

    Jorgen Grahn Guest

    On Thu, 2011-11-24, Keith Thompson wrote:
    > (Joe keane) writes:
    >> No one should work on I18N unless they're Finnish, Hungarian, or Japanese.

    >
    > Do you mean that speakers of those languages have additional insight
    > into internationalization issues? If so, you have a point, but
    > it's vastly overstated (deliberately so for effect, I presume).


    AFAICT, the only thing special with Finland compared to the rest of
    the Latin-1 world is that Finnish has very long words.

    /Jorgen

    --
    // Jorgen Grahn <grahn@ Oo o. . .
    \X/ snipabacken.se> O o .
    Jorgen Grahn, Nov 27, 2011
    #19
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Max
    Replies:
    1
    Views:
    465
    Mike Treseler
    Sep 18, 2003
  2. Max
    Replies:
    1
    Views:
    433
    Jim Lewis
    Sep 18, 2003
  3. D
    Replies:
    2
    Views:
    396
    Teemu Keiski
    Oct 16, 2003
  4. Daniel Nogradi
    Replies:
    0
    Views:
    383
    Daniel Nogradi
    Nov 15, 2006
  5. Replies:
    3
    Views:
    1,100
    James Kanze
    Aug 15, 2008
Loading...

Share This Page