wchar_t is useless

Lauri Alanko · Nov 21, 2011

I have recently written a number of posts regarding C's wide character
support. It now turns out that my investigation has been in vain:
wchar_t is useless in portable C programming, although I'm not quite
sure whether the standard or implementations are to blame for this. Most
likely both: the standard has sanctioned the implementations'
deficiencies.

I'm working on a library that deals with multilingual strings. The
library only does computation, and doesn't have need for very fancy I/O,
so I'm trying to avoid any unnecessary platform dependencies and make
the library as portable as possible.

One question I'm facing is what kind of representation to use for the
multilingual strings in the public API of the library. Internally, the
library reads some binary data containing UTF-8 strings, so the obvious
answer would be for the public library functions to accept and return
strings in a standard unicode format, either UTF-8 or UTF-32.

But this is not very C-ish. Since C has standard ways to represent
multilingual strings, it's more convenient for the API to use those
standard ways rather than introducing yet another string representation
type. I thought.

So I considered the options. Multibyte strings are not a viable choice,
since their encoding is locale-dependent. If the library communicated
via multibyte strings, then the locale would have to be set to something
that made it possible to represent all the strings that the library had
to deal with.

But a library cannot make requirements on the global locale: libraries
should be components that can be plugged together, and if they begin to
make any requirements on the locale, then they cannot be used together
if the requirements conflict.

I cannot understand why C still only has a global locale. C++ came up
with first-class locales ages ago, and surely nowadays everyone should
know that anything global wreaks havoc to interoperability and
re-entrancy.

So I looked at wchar_t. If __STDC_ISO_10646__ is defined and wchar_t
represents a unicode code point, this would be just perfect. But that's
not the case on all platforms. But that's okay, I thought, as long as I
can (with some platform-dependent magic) convert between unicode code
points and wchar_t.

On Windows, it turns out, wchar_t represents a UTF-16 code unit, so a
code point can require two wchar_t's. That's ugly (and makes <wctype.h>
useless), but not very crucial for my purposes. The important thing is
that sequences of code points can still be encoded to and from wide
_strings_. I could have lived with this.

But then I found out about the killer: on FreeBSD (and Solaris?) the
encoding used by wchar_t is locale-dependent! That is, a single wchar_t
can represent any code point supported by the current locale, but the
same wchar_t value may be used to represent different code points in
different locales. So adopting wchar_t as the representation type would
again make the capabilities of the library dependent on the current
locale, which might be constrained by other parts of the application.
(Also, the locale-dependent wchar_t encodings are quite undocumented, so
the required platform-dependent magic would be magic indeed.)

To recap: C's multibyte strings are in a locale-dependent, possibly
variable-width encoding. On Windows, the wchar_t string encoding is
variable-width, on FreeBSD and Solaris it is locale-dependent. So for
portable C code, wchar_t doesn't provide any advantages over multibyte
strings.

So screw it all, I'll just use UTF-32 like I should have from the
beginning.

Lauri

Kaz Kylheku · Nov 21, 2011

I have recently written a number of posts regarding C's wide character
support. It now turns out that my investigation has been in vain:
wchar_t is useless in portable C programming, although I'm not quite

That is false; it is useful.

So I looked at wchar_t. If __STDC_ISO_10646__ is defined and wchar_t
represents a unicode code point, this would be just perfect. But that's
not the case on all platforms. But that's okay, I thought, as long as I
can (with some platform-dependent magic) convert between unicode code
points and wchar_t.

wchar_t is an integral type which represents an integral value. It does not
represent a code point any more than "char" represents an ASCII value.

On Windows, it turns out, wchar_t represents a UTF-16 code unit, so a
code point can require two wchar_t's. That's ugly (and makes <wctype.h>

This is a limitation of Windows. The Windows API uses 16 bit wide characters,
so you can't get away from this no matter what language you write in on
Windows.

Redmond has decided that characters outside of the Unicode BMP (basic
multilingual plane) are unimportant for its user base. So, if your programs has
customers who are Windows users, you can safely assume that they have already
swallowed this pill.

You get a lot of internationalization mileage out of the BMP. Actually all the
mileage. Above U+FFFF there is only academic crap. Anyone who cares about those
characters is likely also going to be some kind of "freetard" who won't pay a
dime for software.

But then I found out about the killer: on FreeBSD (and Solaris?) the
encoding used by wchar_t is locale-dependent!

I would expect this "locale dependent" to mean that if, say, Japanese user is
working with Shift-JIS files, then he or she can set that up in the locale such
that when these files are processed by your program, the characters being read
and written map to sane values of wchar_t (where sane == based on Unicode!).

wchar_t does not have an encoding; it's an integral type. The encoding
of wchar_t is binary enumeration: 000...0101 encodes 5, etc.

Do you have some quotes from FreeBSD or Solaris documentation on this matter
that are giving you concern? Post them.

So screw it all, I'll just use UTF-32 like I should have from the
beginning.

But that just means you have to write your own library instead of just using
C95 functions like wscspn, wcscpy, etc. What if you want to do printf-like
formatting to a wide string? Can't use swprintf.

Here is a better idea: just use wchar_t, forget about U+1XXXX on Windows
because Microsoft has decided that one for your users already, and if
locale-dependent streams give you an allergic reaction, handle your own
decoding/encoding for doing I/O.

James Kuyper · Nov 21, 2011

On 11/21/2011 11:17 AM, Kaz Kylheku wrote:
....

You get a lot of internationalization mileage out of the BMP. Actually all the
mileage. Above U+FFFF there is only academic crap.

Academics have a need for software support too. One of my old friends
has a husband who works mainly with dead languages; when I met him in
1990 he could read and write 14 of them; he's probably added more since
then. I suspect he would find software that supported Plane 1 useful.

The sources I checked didn't give any indication how adequate the BMP
characters are for representing Chinese text. If the Unified Han
Ideographs in Plane 2 are in fact needed for some purpose, there's a
very large number of Chinese who would need them. That's hardly an
academic issue.

... Anyone who cares about those
characters is likely also going to be some kind of "freetard" who won't pay a
dime for software.

As a professional software developer myself, I fully agree with the idea
of paying people for their work. However, why should anyone buy software
when they have free software available that is of acceptable quality,
containing all the features they require?

The only reason I can make money developing software is that there's no
one willing to give away software that does what mine does, and that's
just the way it should be.

Ben Pfaff · Nov 21, 2011

Lauri Alanko said:
To recap: C's multibyte strings are in a locale-dependent, possibly
variable-width encoding. On Windows, the wchar_t string encoding is
variable-width, on FreeBSD and Solaris it is locale-dependent. So for
portable C code, wchar_t doesn't provide any advantages over multibyte
strings.

I agree with you.

The libunistring manual has a section that says pretty much what
you did in your message, by the way:
http://www.gnu.org/software/libunistring/manual/libunistring.html#The-wchar_005ft-mess

Jack McCue · Nov 21, 2011

I agree with you. ditto

The libunistring manual has a section that says pretty much what
you did in your message, by the way:
http://www.gnu.org/software/libunistring/manual/libunistring.html#The-wchar_005ft-mess

Thanks for the URL, I struggled with wchar_t
on AIX for a bit then ended up writing a set of
small functions, my needs were simple at the time.
At least I know why I had a hard time, thought I was
missing something. maybe I still am

Regards,
Jack

Kaz Kylheku · Nov 21, 2011

I agree with you.

The libunistring manual has a section that says pretty much what
you did in your message, by the way:
http://www.gnu.org/software/libunistring/manual/libunistring.html#The-wchar_005ft-mess

It probably pretty much says the same thing, because quite likely that text is
the source for Lauri's opinion, or both have some other common source. For
instance, look at this:

``On Solaris and FreeBSD, the wchar_t encoding is locale dependent and undocumented.''

Eerie similarity!

I don't agree with this libunistring manual. The wchar_t type is useful and
just fine.

The are right about the limitation of Windows, but nobody ever went wrong in
accepting the limitations of Microsoft Windows in order to write software for
users of Windows who have also accepted those limitations.

If you want to do processing with rare languages on Windows, install a virtual
machine running GNU/Linux and you have 32 bit wchar_t. GNU/Linux is more
likely than Redmond to have fonts to display your rare languages, too.

Cleraly the libunistring authors they don't understand what Solaris and FreeBSD
means by "encoding" (and they do not care whether they are right or wrong because,
after all, they have a library which will fix the FreeBSD or Solaris problem,
regardless of whether it is real or imagined.) Hey, a user who needlessly uses your
library is still a user!

And undocumented, by the way? Uh, use the source, Luke?

Oh, and The Single Unix Specification, Issue 6, says this about wchar_t:

wchar_t

Integer type whose range of values can represent distinct wide-character
codes for all members of the largest character set specified among the
locales supported by the compilation environment: the null character has
the code value 0 and each member of the portable character set has a code
value equal to its value when used as the lone character in an integer
character constant.

I very much doubt that FreeBSD and Solaris go against the grain on this one
in any way.

Lauri Alanko · Nov 21, 2011

http://www.gnu.org/software/libunistring/manual/libunistring.html#The-wchar_005ft-mess

It probably pretty much says the same thing, because quite likely that text is
the source for Lauri's opinion, or both have some other common source.

That is indeed where I learned about the locale-dependency of wchar_t.
I found it hard to believe myself, so I checked.

http://svnweb.freebsd.org/base/head/lib/libc/locale/euc.c?revision=172619&view=markup

Here we have the following:

199 for (i = (es->want == 0) ? 1 : 0; i < MIN(want, n); i++) {
200 if (*s == '\0') {
201 errno = EILSEQ;
202 return ((size_t)-1);
203 }
204 wc = (wc << 8) | (unsigned char)*s++;
205 }

That is, in the EUC locale, the wchar_t value of a character consists
of just the bits of the variable-width encoding of that character in
EUC. From quick perusing of the source, other variable-width encodings
seem to work the same way, except for utf8.c, which decodes the code
point and stores that in wchar_t.

As for solaris, I tried it out:

$ uname -a
SunOS kruuna.helsinki.fi 5.10 Generic_127111-05 sun4u sparc
$ cat wc.c
#include <stdio.h>
#include <wchar.h>
#include <locale.h>

int main(int argc, char* argv[]) {
setlocale(LC_CTYPE, argv[1]);
wchar_t wc = fgetwc(stdin);
printf("%08lx\n", (unsigned long) wc);
return 0;
}
$ echo -e '\xa4' | ./wc fi_FI.ISO8859-1 # U+00A4 CURRENCY SIGN
30000024
$ echo -e '\xa4' | ./wc fi_FI.ISO8859-15 # U+20AC EURO SIGN
30000024
$ echo -e '\xa4' | iconv -f iso-8859-1 -t utf-8 | ./wc fi_FI.UTF-8
000000a4
$ echo -e '\xa4' | iconv -f iso-8859-15 -t utf-8 | ./wc fi_FI.UTF-8
000020ac

Frankly, I cannot understand how platforms like these could support
C1X where wide string literals (whose encoding has to be decided at
compile time before any locale is selected) can contain unicode
escapes.

Lauri

Kaz Kylheku · Nov 21, 2011

That is indeed where I learned about the locale-dependency of wchar_t.
I found it hard to believe myself, so I checked.

http://svnweb.freebsd.org/base/head/lib/libc/locale/euc.c?revision=172619&view=markup

Here we have the following:

199 for (i = (es->want == 0) ? 1 : 0; i < MIN(want, n); i++) {
200 if (*s == '\0') {
201 errno = EILSEQ;
202 return ((size_t)-1);
203 }
204 wc = (wc << 8) | (unsigned char)*s++;
205 }

So it's obvious here that a wchar_t does not have an encoding. Some other
encoding is being decoded, and that becomes the value of wchar_t.

That is, in the EUC locale, the wchar_t value of a character consists
of just the bits of the variable-width encoding of that character in
EUC. From quick perusing of the source, other variable-width encodings
seem to work the same way, except for utf8.c, which decodes the code
point and stores that in wchar_t.

But is that wrong?

Decoding the utf8 code point is certainly right.

Based on anything you know about EUC (I know nothing), is EUC being handled
properly above? (Furthermore, do you care about the EUC encoding?)

This code is inside the mbrtowc function. Of course mbrtowc is
locale-dependent, by design. It converts multibyte strings to wchar_t, and it
has to do so according to an encoding! This function is locale-dependent,
not the wchar_t type. (And you don't have to use this function.)

Definitely, it's a good idea to do your own encoding and decoding, for
portability, at least in some kinds of programs.

The ISO C standard gives us this:

"At program startup, the equivalent of:

setlocale(LC_ALL, "C");

is executed."

So in C, you are automatically in the safe "C" locale, which specifies the
"minimal environment for C translation". You're insulated from the effect of
the native environment locale until you explicitly call setlocale(LC_ALL, "").

If you don't want to do localization using the C library, just don't
call setlocale, and do all your own converting from external formats. You can
still use wchar_t. Just don't use wide streams, don't use mbstowcs, etc.

By the way, feel free to take any code (BSD licensed) from here:

http://www.kylheku.com/cgit/txr/tree/

I've handled the internationalization of the program by restricting
all I/O to utf-8 and using wchar_t to store characters. On Cygwin and Win32,
text is resticted to U+0000 through U+FFFF. Users who find that
lacking can use a better OS. Problem solved.

Frankly, I cannot understand how platforms like these could support
C1X where wide string literals (whose encoding has to be decided at
compile time before any locale is selected) can contain unicode
escapes.

Simply by treating all conversions to wchar_t as targetting a common
representation (Unicode).

So for instance, suppose you have the character é‡Ž in a literal, perhaps as a
UTF-8 multibyte character, or a \u sequence in ASCII. This maps to a wchar_t
which has the Unicode value.

The user is in a Shift-JIS locale, and inputs a string which contains
é‡Ž in Shift-JIS encoding. You convert that to wchar_t using the correct
locale and, and blam: same character code as what came from your string
literal.

Lauri Alanko · Nov 22, 2011

So it's obvious here that a wchar_t does not have an encoding. Some other
encoding is being decoded, and that becomes the value of wchar_t.

That is a very strange way of putting it. Certainly wchar_t has _an_
encoding, that is, a mapping between abstract characters and integer
values. (In Unicode terminology, it's a "coded character set".)

The euc.c module is a bit of a complex example, since it is
parameterized (as there are many variants of EUC):

http://www.gsp.com/cgi-bin/man.cgi?section=5&topic=euc

Even the man page explicitly says that the encoding of wchar_t is
dependent on the precise definition of the locale. For instance,
character for love (U+611B), which is encoded in EUC-JP as "\xb0\xa6"
is represented by the wchar_t value 0xb0a6.

But is that wrong?

No _single_ encoding is wrong, the problem is that these different
locales have different encodings for wchar_t. In the utf-8 locale, the
character for love is represented by the wchar_t value 0x611b. So now
if I want my library to input and output wchar_t values, _I need to
know which locale they were produced on_ in order to know how to
interpret them.

This code is inside the mbrtowc function. Of course mbrtowc is
locale-dependent, by design. It converts multibyte strings to wchar_t, and it
has to do so according to an encoding! This function is locale-dependent,
not the wchar_t type.

The standard library functions, and wide string literals, are what
imbue wchar_t values with an indended interpretation as characters.
Without the intended interpretation, wchar_t would just be a plain
integer type that wouldn't fulfill any function that other integer
types wouldn't already.

Definitely, it's a good idea to do your own encoding and decoding, for
portability, at least in some kinds of programs.

I'm not concerned with external encodings (other than UTF-8, which is
used by a certain file format I process). I can let the user of my
library worry about those. I'm concerned with the API, and the choice
of representation for strings. It's not only a question of choosing a
type, there must also be an interpretation for values of that type.
And for wchar_t, it seems, the interpretation can be quite volatile.

If you don't want to do localization using the C library, just don't
call setlocale, and do all your own converting from external formats.

I'm writing a _library_. As I explained earlier, a library cannot
control, or constrain, the current locale. Perhaps someone would like
to plug the library into a legacy application that needs to be run
in a certain locale. As a library writer, it's my job to make sure
that this is possible without pain.

You can still use wchar_t. Just don't use wide streams, don't use
mbstowcs, etc.

I indeed do not need to use those, but the user of the library
presumably might. Now suppose someone calls a function in my library,
and I wish to return the character for love as a wchar_t. Now how can
I know which wchar_t value I should return?

I've handled the internationalization of the program by restricting
all I/O to utf-8 and using wchar_t to store characters. On Cygwin and Win32,
text is resticted to U+0000 through U+FFFF. Users who find that
lacking can use a better OS. Problem solved.

It's curious that you find this particular limitation of Windows to be
significant. It's a nuisance, sure, but I don't see why it would be so
important to have a single wchar_t value represent a whole code point.
The only important operations on individual wchar_t's are those in
<wctype.h>, but if you need to classify code points at all, you are soon
likely to need more detailed access to Unicode character properties
that goes beyond what <wctype.h> provides.

And if you need to split a piece of text into discrete units, I don't
see why code points, especially of unnormalized or NFC-normalized
text, would be any more important units than, say, grapheme clusters.

Simply by treating all conversions to wchar_t as targetting a common
representation (Unicode).

You mean, rewriting all those locale modules so that wchar_t always
has a consistent value (the unicode code point) for a given character,
regardless of the way it is encoded in the current module?

That's effectively what I was saying: those platforms, as they
currently stand, cannot have locale-independent unicode literals, so
they have to be modified.

But actually, I'm not quite sure if C1X really requires unicode
literals to be locale-independent. The text on character constants,
string literals and universal character names is really confusing, and
there's talk about "an implementation-dependent current locale", so it
might be that even C1X allows the meaning of wide string literals to
vary between locales. It'd be a shame if this is true.

Lauri

Kaz Kylheku · Nov 22, 2011

For instance,
character for love (U+611B), which is encoded in EUC-JP as "\xb0\xa6"
is represented by the wchar_t value 0xb0a6.

Ah, if that's the case, that is pretty broken. \xb0\xa6 should decode into the
value 0x611B. They ducked out of doing it right, didn't they. Perhaps this
preserves the behavior of legacy programs which expect the mapping
to work that way.

You mean, rewriting all those locale modules so that wchar_t always
has a consistent value (the unicode code point) for a given character,

Or not using/supporting the locales that don't produce Unicode code points.
You can treat those as weird legacy cruft, like EBCDIC. Find out what works,
and document that as being supported. "This is a Unicode program, whose
embedded strings are in Unicode, and which requires a Unicode-compatible
locale."

Either way, you don't have to throw out the wchar_t. It is handy because it's
supported in the form of string literals, and some useful functions like
wcsspn, wcschr, etc.

I think you have to regard these two problems as being completely separate:

- writing software that is multilingual.
- targetting two or more incompatible ways of being multilingual,
simultaneously in the same program. (incompatible meaning that the
internal representation for characters follows a different map.)

I think you're taking too much into your scope: you want to solve both
problems, and so then when you look at this FreeBSD mess, it looks
intractable.

Solve the first problem, and forget the second.

The only people needing to solve the second problem are those who
are saddled with legacy support requirements, like having to continue
being able to read data from 20 year old versions of the software.

Lauri Alanko · Nov 22, 2011

Ah, if that's the case, that is pretty broken. \xb0\xa6 should decode into the
value 0x611B.

It should, if __STDC_ISO_10646__ were defined. The standard doesn't
require it to be.

They ducked out of doing it right, didn't they.

BSD predates Unicode. I can quite well understand that old locale code
doesn't map characters into their unicode values. What I didn't expect
is that there would be no attempt to keep the non-standard mappings
consistent between locales.

Or not using/supporting the locales that don't produce Unicode code points.
You can treat those as weird legacy cruft, like EBCDIC.

I have worked on an EBCDIC platform. They are real.

Find out what works,
and document that as being supported. "This is a Unicode program, whose
embedded strings are in Unicode, and which requires a Unicode-compatible
locale."

As I was saying, wchar_t is useless in portable C programming. You
seem to be concurring, although in a roundabout way.

Either way, you don't have to throw out the wchar_t. It is handy because it's
supported in the form of string literals, and some useful functions like
wcsspn, wcschr, etc.

None of those are particularly useful for me. I wouldn't be using
wchar_t* for data storage or processing anyway, just for interchange.

And again, I don't see a single code point as being a very meaningful
unit of text. If you need to search for a piece of text, you most
likely need to search for a substring.

The only use wchar_t could have had for me was if it had been an
established, well-defined way of representing multilingual text. As it
stands, it is far too underdefined, so it has no real use for me.

I think you have to regard these two problems as being completely separate:

- writing software that is multilingual.
- targetting two or more incompatible ways of being multilingual,
simultaneously in the same program. (incompatible meaning that the
internal representation for characters follows a different map.)

I think you're taking too much into your scope: you want to solve both
problems, and so then when you look at this FreeBSD mess, it looks
intractable.

Solve the first problem, and forget the second.

That's what I'm doing, and that's why I'm going to forget about wchar_t.

Lauri

Dann Corbit · Nov 22, 2011

Or you could use this, which is what every sensible person does:
http://www-01.ibm.com/software/globalization/icu/

Joe keane · Nov 24, 2011

No one should work on I18N unless they're Finnish, Hungarian, or Japanese.

Kaz Kylheku · Nov 24, 2011

No one should work on I18N unless they're Finnish, Hungarian, or Japanese.

It's no longer realistic to write programs that do any text processing, but
handle only 8 bit text, even if those programs are not actually multilingual.

(I mean programs for the world to use, not just for use by the author and maybe
a few of his ISO-latin-character-using colleagues.)

Keith Thompson · Nov 24, 2011

No one should work on I18N unless they're Finnish, Hungarian, or Japanese.

Do you mean that speakers of those languages have additional insight
into internationalization issues? If so, you have a point, but
it's vastly overstated (deliberately so for effect, I presume).

James Kuyper · Nov 24, 2011

It's no longer realistic to write programs that do any text processing, but
handle only 8 bit text, even if those programs are not actually multilingual.

(I mean programs for the world to use, not just for use by the author and maybe
a few of his ISO-latin-character-using colleagues.)

Not even if those "few colleagues" number in the hundreds of millions?
I'm in favor of I18N, but I also work is one of the largest markets in
the world where it's quite feasible to make a decent profit on software
that has support for only one language.

88888 Dihedral · Nov 25, 2011

I have recently written a number of posts regarding C's wide character
support. It now turns out that my investigation has been in vain:
wchar_t is useless in portable C programming, although I'm not quite
sure whether the standard or implementations are to blame for this. Most
likely both: the standard has sanctioned the implementations'
deficiencies.

I'm working on a library that deals with multilingual strings. The
library only does computation, and doesn't have need for very fancy I/O,
so I'm trying to avoid any unnecessary platform dependencies and make
the library as portable as possible.

One question I'm facing is what kind of representation to use for the
multilingual strings in the public API of the library. Internally, the
library reads some binary data containing UTF-8 strings, so the obvious
answer would be for the public library functions to accept and return
strings in a standard unicode format, either UTF-8 or UTF-32.

But this is not very C-ish. Since C has standard ways to represent
multilingual strings, it's more convenient for the API to use those
standard ways rather than introducing yet another string representation
type. I thought.

So I considered the options. Multibyte strings are not a viable choice,
since their encoding is locale-dependent. If the library communicated
via multibyte strings, then the locale would have to be set to something
that made it possible to represent all the strings that the library had
to deal with.

But a library cannot make requirements on the global locale: libraries
should be components that can be plugged together, and if they begin to
make any requirements on the locale, then they cannot be used together
if the requirements conflict.

I cannot understand why C still only has a global locale. C++ came up
with first-class locales ages ago, and surely nowadays everyone should
know that anything global wreaks havoc to interoperability and
re-entrancy.

So I looked at wchar_t. If __STDC_ISO_10646__ is defined and wchar_t
represents a unicode code point, this would be just perfect. But that's
not the case on all platforms. But that's okay, I thought, as long as I
can (with some platform-dependent magic) convert between unicode code
points and wchar_t.

On Windows, it turns out, wchar_t represents a UTF-16 code unit, so a
code point can require two wchar_t's. That's ugly (and makes <wctype.h>
useless), but not very crucial for my purposes. The important thing is
that sequences of code points can still be encoded to and from wide
_strings_. I could have lived with this.

But then I found out about the killer: on FreeBSD (and Solaris?) the
encoding used by wchar_t is locale-dependent! That is, a single wchar_t
can represent any code point supported by the current locale, but the
same wchar_t value may be used to represent different code points in
different locales. So adopting wchar_t as the representation type would
again make the capabilities of the library dependent on the current
locale, which might be constrained by other parts of the application.
(Also, the locale-dependent wchar_t encodings are quite undocumented, so
the required platform-dependent magic would be magic indeed.)

To recap: C's multibyte strings are in a locale-dependent, possibly
variable-width encoding. On Windows, the wchar_t string encoding is
variable-width, on FreeBSD and Solaris it is locale-dependent. So for
portable C code, wchar_t doesn't provide any advantages over multibyte
strings.

So screw it all, I'll just use UTF-32 like I should have from the
beginning.

Lauri

The c string part is too slow in many applications nowadays by the default standard way of no length tagged but a terminator marked. That was goodto teach pointers and opened the door to write assembly and c together inmany platforms to support a c compiler long long time ago!

Rui Maciel · Nov 25, 2011

Joe said:
No one should work on I18N unless they're Finnish, Hungarian, or Japanese.

Why is that?

Rui Maciel

Jorgen Grahn · Nov 27, 2011

Do you mean that speakers of those languages have additional insight
into internationalization issues? If so, you have a point, but
it's vastly overstated (deliberately so for effect, I presume).

AFAICT, the only thing special with Finland compared to the rest of
the Latin-1 world is that Finnish has very long words.

/Jorgen

Using wchar_t instead of char	15	Jul 8, 2008
Encoding of character literals	4	Nov 3, 2011
wchar_t and wide characters	1	Mar 13, 2006
Is char obsolete?	20	Apr 8, 2011
Problem using wchar_t and wprintf	8	Feb 27, 2007
compile error when make a wchar_t conversion	1	Nov 26, 2007
std::wstringbuf and imbue to convert from utf-8 to wchar_t?	3	Nov 2, 2008
C++, wchar_t, Unicode and all that stuff	3	Dec 23, 2005

wchar_t is useless

Lauri Alanko

Kaz Kylheku

James Kuyper

Ben Pfaff

Jack McCue

Kaz Kylheku

Lauri Alanko

Kaz Kylheku

Lauri Alanko

Kaz Kylheku

Lauri Alanko

Dann Corbit

Joe keane

Kaz Kylheku

Keith Thompson

James Kuyper

88888 Dihedral

Rui Maciel

Jorgen Grahn

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads