multibyte characters

T

TK

Hi,

how can I handle multibyte characters like ä, ü (german vowel mutation)?

This does't work:

switch(c)

case 'ä':
... some action
break;
case 'ü':
... some action
break;
....
....

Thanks for help.

o-o

Thomas
 
B

Ben Bacarisse

TK said:
how can I handle multibyte characters like ä, ü (german vowel mutation)?

This does't work:

switch(c)

case 'ä':
... some action
break;
case 'ü':
... some action
break;
...
...

wchar_t c;

with L'ä' or L'\u00e4'. Using UCNs (the \uxxxx syntax) is probably
more portable.
 
G

Gianni Mariani

Ben said:
wchar_t c;

with L'ä' or L'\u00e4'. Using UCNs (the \uxxxx syntax) is probably
more portable.


This all depends on which character encoding is being used. wchat_t is
not necessarily a Unicode character.
 
R

Ron Natalie

Gianni said:
This all depends on which character encoding is being used. wchat_t is
not necessarily a Unicode character.

And L'...' doesn't generate a MULTIBYTE character. It makes a wide
character.
 
J

James Kanze

wchar_t c;
with L'ä' or L'\u00e4'. Using UCNs (the \uxxxx syntax) is probably
more portable.

It depends (and his question is opening a can of worms). If
he's not interested in internationalization---the program will
only be used in German speaking areas, then using wide
characters is overkill. Maybe. Independantly of the question
wchar_t vs. char, the very first question is what encoding he is
using at execution time, and what encoding the compiler supposes
he is using. If, for example, he is using ISO 8859-1
everywhere, exactly what he has written might actually work---it
works with all the compilers I have here at work (where
everything is ISO 8859-1): g++, Sun CC and VC++. It probably
won't work on my Unix system at home, because there I use UTF-8.
If his environment uses UTF-8 anywhere, he'll have to find a
different solution: in UTF-8, 'ä' is a multi-byte character
(0xC3, 0xA4).

The solution he should probably adopt depends a lot on context.
If he can get away with only the characters in ISO 8859-1 (which
is sufficient for German---but he might have to handle proper
names with other characters in them), it's definitely easier to
code. If in addition he can configure his editor so that it
also writes all files in ISO 8859-1 (":set fileencoding=latin1"
in vim), and he is using one of the compilers I use (Sun CC, g++
or VC++), then he can even write the Umlauts in his source code
(but IMHO, that's pushing things a bit---I'd just use 0xE4,
etc.). If he has to deal with other characters, or with files
which might use other encodings, the problem becomes more
difficult. I usually use UTF-8, even internally, but which
encoding format to choose depends somewhat on what you are doing
with the text, and probably to some degree on the compiler as
well: for some jobs, you'll absolutely want UTF-32 (which means
using int32_t, and not wchar_t). Of course, if he's using
UTF-8, something like the above would have to be written using
an if/else chain, and not as a switch. If this only occurs
once, and there are only three or four cases in the switch, it's
no big deal; if it occurs in a lot of places, that's probably a
sign that UTF-8 is not the correct choice for your application.

Regardless of the solution chosen, you have to consider four
encodings: that in the files you are reading and writing, that
which you use internally, that which the compiler assumes you
are using, and if you use the umlauted characters in your
source, that which the compiler uses to read your sources. Note
that L'\u00E4' isn't a panacea either. The compiler will
translate it into the a-Umlaut in whatever encoding it thinks
you are using internally. If the encoding it thinks you are
using is the one you are actually using, fine. If not,
however... If you know that you want to use Unicode, UTF-32
format, for example, your only portable solution is something
like:

typedef uint32_t UTF32Char ;
UTF32Char const aUmlaut = 0x00E4 ;
// ...

Of course, if you do this, you'll probably have to reimplement
large parts of iostream and locale as well.
 
P

Pete Becker

typedef uint32_t UTF32Char ;

In the future this won't be necessary. The next C++ standard will
provide char16_t and char32_t, and appropriate specializations of
std::string, for UTF-16- and UTF-32-encoded characters.
 
A

Andreas Dehmel

In the future this won't be necessary. The next C++ standard will
provide char16_t and char32_t, and appropriate specializations of
std::string, for UTF-16- and UTF-32-encoded characters.

Let's hope the next standard will also provide comprehensive transcoding
functionality between arbitrary encodings -- as part of the standard (lib)
I mean, not as part of another library -- because without that any string
types/classes it defines will be almost completely useless. And let's
also hope the new file-IO interface will understand these classes as well,
otherwise ditto.



Andreas
 
P

Pete Becker

Let's hope the next standard will also provide comprehensive transcoding
functionality between arbitrary encodings -- as part of the standard (lib)
I mean, not as part of another library -- because without that any string
types/classes it defines will be almost completely useless. And let's
also hope the new file-IO interface will understand these classes as well,
otherwise ditto.

There's no new file-IO interface under discussion. As for the current
one, basic_fstream, it already deals with codecvt facets, and that's
the mechanism for translating between character encodings. There are
some new convenience classes for common conversions (see
www.versatilecoding.com for a quick overview), and there will be
builtin codecvt facets for a few common conversions in support of
Unicode.
 
A

Andreas Dehmel

There's no new file-IO interface under discussion. As for the current
one, basic_fstream, it already deals with codecvt facets, and that's
the mechanism for translating between character encodings. There are
some new convenience classes for common conversions (see
www.versatilecoding.com for a quick overview), and there will be
builtin codecvt facets for a few common conversions in support of
Unicode.

You appear to be talking about the contents of files. As far as file-IO
is concerned I was talking about the names of files. The days when
filenames could be assumed to be US-ASCII or at least an 8-bit encoding
like the ISO-8859-* family are ancient history and ATM support for this
sort of thing is practically non-existent as far as the C/C++ standard
libs are concerned.



Andreas
 
P

Pete Becker

You appear to be talking about the contents of files. As far as file-IO
is concerned I was talking about the names of files. The days when
filenames could be assumed to be US-ASCII or at least an 8-bit encoding
like the ISO-8859-* family are ancient history and ATM support for this
sort of thing is practically non-existent as far as the C/C++ standard
libs are concerned.

Wide-character file names were added to the specification for C++0x a
year or so ago. Might I suggest that you look at the draft standard
before complaining about what's not in it? The current draft is
available at
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2461.pdf.
 
J

James Kanze

On 2007-11-16 05:50:47 -0500, James Kanze <[email protected]> said:
In the future this won't be necessary. The next C++ standard will
provide char16_t and char32_t, and appropriate specializations of
std::string, for UTF-16- and UTF-32-encoded characters.

So I've heard.

I've not been following too closely: will the implementation
also be required to provide the appropriate specializations for
basic_iostream et al, the facets, and basic_string?

And will these types be "conditional" (only available if the
implementation decides to support the encoding), or required on
all implementations?
 
P

Pete Becker

So I've heard.

I've not been following too closely: will the implementation
also be required to provide the appropriate specializations for
basic_iostream et al, the facets, and basic_string?

And will these types be "conditional" (only available if the
implementation decides to support the encoding), or required on
all implementations?

Support for UTF-8, UTF-16, and UTF-32 is required, at the level of
having those typedefs, having the appropriate specializations for
basic_string, and a handful of other things to support conversions.
There's a brief sketch at www.versatilecoding.com.
 
P

Pete Becker

Support for UTF-8, UTF-16, and UTF-32 is required, at the level of
having those typedefs, having the appropriate specializations for
basic_string, and a handful of other things to support conversions.
There's a brief sketch at www.versatilecoding.com.

Whoops, char16_t and char32_t will be builtin types, not typedefs.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,009
Latest member
GidgetGamb

Latest Threads

Top