Converting between Unicode and default locale

G

Gianni Mariani

Keith said:
Well, my question has certainly generated a lot of responses, but not the
kind I was hoping for. Clearly, I was being completely naive to expect the
standard library to include this facility, but I am completely disheartened
not to have found a single working example of how to do code conversion in
streams using 3rd party libraries, such as iconv. Presumably this is
because nobody does it that way.

[RANT] It seems crazy that after a decade of Unicode use, C++ still requires
everyone to reinvent the wheel and do it their own way. I think that the
standards committee is being too precious about this. I know that Unicode
is a moving target, but UCS-2 would suffice for 95% of my requirements - and
100% for those who don't know the difference between it and UTF-16. After
all, the C++ char type doesn't even support the full British English
character set (never mind those of the rest of Europe), without using
non-standard compiler options to make char unsigned. Please, anything is
better than nothing! [/RANT]

You RANT is mostly justified.

However, there are a number of libraries that provide the support you
asking for.

If you have the energy to propose a revision to the C++ standard then do
so but it's a very complex problem to get right. In regards to just
UCS-2 support, you would probably not have anyone on the standards
comittee agree on that.
 
R

Ron Natalie

Gianni Mariani said:
However, there are a number of libraries that provide the support you
asking for.
The problem is that you can't even implement this without redefining/extending
the C++ standard library classes. The problem is that wchar_t is incompletely
supported in the C++ library, so even if you were to fix up everything in your
implementation, you'd still have to add non-conforming extensions.
 
G

Gianni Mariani

Ron said:
The problem is that you can't even implement this without redefining/extending
the C++ standard library classes. The problem is that wchar_t is incompletely
supported in the C++ library, so even if you were to fix up everything in your
implementation, you'd still have to add non-conforming extensions.

An option is not to use what_t at all. Stick to multibyte. Perform all
the processing in utf-8 multibyte. (you need to make sure you provide
support to convert any incoming strings to utf-8.

Even for UTF-32 you need to deal with multi-"unit" issues because of
composing characters. I don't remember specifically what the 10646
standard says but processing text with composed characters has many of
the same restrictions as multibyte characters (keeping them together).

Processing utf-16 or utf-32, you have issues with endianness or managing
the byte-order-mark which makes it a stateful encoding. This breaks a
whole bunch of subltle assumptions about the indexability of files. No
such problem exists with utf-8.

It just makes a whole lot of sense to use utf-8 everywhere when possible.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,777
Messages
2,569,604
Members
45,235
Latest member
Top Crypto Podcasts_

Latest Threads

Top