multibyte characters

TK · Nov 15, 2007

Hi,

how can I handle multibyte characters like ä, ü (german vowel mutation)?

This does't work:

switch(c)

case 'ä':
... some action
break;
case 'ü':
... some action
break;
....
....

Thanks for help.

o-o

Thomas

Ben Bacarisse · Nov 15, 2007

TK said:
how can I handle multibyte characters like Ã¤, Ã¼ (german vowel mutation)?

This does't work:

switch(c)

case 'Ã¤':
... some action
break;
case 'Ã¼':
... some action
break;
...
...

wchar_t c;

with L'Ã¤' or L'\u00e4'. Using UCNs (the \uxxxx syntax) is probably
more portable.

Gianni Mariani · Nov 15, 2007

Ben said:
wchar_t c;

with L'Ã¤' or L'\u00e4'. Using UCNs (the \uxxxx syntax) is probably
more portable.

This all depends on which character encoding is being used. wchat_t is
not necessarily a Unicode character.

Ron Natalie · Nov 15, 2007

Gianni said:
This all depends on which character encoding is being used. wchat_t is
not necessarily a Unicode character.

And L'...' doesn't generate a MULTIBYTE character. It makes a wide
character.

James Kanze · Nov 16, 2007

wchar_t c;

with L'ä' or L'\u00e4'. Using UCNs (the \uxxxx syntax) is probably
more portable.

It depends (and his question is opening a can of worms). If
he's not interested in internationalization---the program will
only be used in German speaking areas, then using wide
characters is overkill. Maybe. Independantly of the question
wchar_t vs. char, the very first question is what encoding he is
using at execution time, and what encoding the compiler supposes
he is using. If, for example, he is using ISO 8859-1
everywhere, exactly what he has written might actually work---it
works with all the compilers I have here at work (where
everything is ISO 8859-1): g++, Sun CC and VC++. It probably
won't work on my Unix system at home, because there I use UTF-8.
If his environment uses UTF-8 anywhere, he'll have to find a
different solution: in UTF-8, 'ä' is a multi-byte character
(0xC3, 0xA4).

The solution he should probably adopt depends a lot on context.
If he can get away with only the characters in ISO 8859-1 (which
is sufficient for German---but he might have to handle proper
names with other characters in them), it's definitely easier to
code. If in addition he can configure his editor so that it
also writes all files in ISO 8859-1 (":set fileencoding=latin1"
in vim), and he is using one of the compilers I use (Sun CC, g++
or VC++), then he can even write the Umlauts in his source code
(but IMHO, that's pushing things a bit---I'd just use 0xE4,
etc.). If he has to deal with other characters, or with files
which might use other encodings, the problem becomes more
difficult. I usually use UTF-8, even internally, but which
encoding format to choose depends somewhat on what you are doing
with the text, and probably to some degree on the compiler as
well: for some jobs, you'll absolutely want UTF-32 (which means
using int32_t, and not wchar_t). Of course, if he's using
UTF-8, something like the above would have to be written using
an if/else chain, and not as a switch. If this only occurs
once, and there are only three or four cases in the switch, it's
no big deal; if it occurs in a lot of places, that's probably a
sign that UTF-8 is not the correct choice for your application.

Regardless of the solution chosen, you have to consider four
encodings: that in the files you are reading and writing, that
which you use internally, that which the compiler assumes you
are using, and if you use the umlauted characters in your
source, that which the compiler uses to read your sources. Note
that L'\u00E4' isn't a panacea either. The compiler will
translate it into the a-Umlaut in whatever encoding it thinks
you are using internally. If the encoding it thinks you are
using is the one you are actually using, fine. If not,
however... If you know that you want to use Unicode, UTF-32
format, for example, your only portable solution is something
like:

typedef uint32_t UTF32Char ;
UTF32Char const aUmlaut = 0x00E4 ;
// ...

Of course, if you do this, you'll probably have to reimplement
large parts of iostream and locale as well.

TK · Nov 16, 2007

Thanks for help.

o-o

Thomas

Pete Becker · Nov 16, 2007

typedef uint32_t UTF32Char ;

In the future this won't be necessary. The next C++ standard will
provide char16_t and char32_t, and appropriate specializations of
std::string, for UTF-16- and UTF-32-encoded characters.

Andreas Dehmel · Nov 16, 2007

In the future this won't be necessary. The next C++ standard will
provide char16_t and char32_t, and appropriate specializations of
std::string, for UTF-16- and UTF-32-encoded characters.

Let's hope the next standard will also provide comprehensive transcoding
functionality between arbitrary encodings -- as part of the standard (lib)
I mean, not as part of another library -- because without that any string
types/classes it defines will be almost completely useless. And let's
also hope the new file-IO interface will understand these classes as well,
otherwise ditto.

Andreas

Pete Becker · Nov 16, 2007

Let's hope the next standard will also provide comprehensive transcoding
functionality between arbitrary encodings -- as part of the standard (lib)
I mean, not as part of another library -- because without that any string
types/classes it defines will be almost completely useless. And let's
also hope the new file-IO interface will understand these classes as well,
otherwise ditto.

There's no new file-IO interface under discussion. As for the current
one, basic_fstream, it already deals with codecvt facets, and that's
the mechanism for translating between character encodings. There are
some new convenience classes for common conversions (see
www.versatilecoding.com for a quick overview), and there will be
builtin codecvt facets for a few common conversions in support of
Unicode.

Andreas Dehmel · Nov 16, 2007

There's no new file-IO interface under discussion. As for the current
one, basic_fstream, it already deals with codecvt facets, and that's
the mechanism for translating between character encodings. There are
some new convenience classes for common conversions (see
www.versatilecoding.com for a quick overview), and there will be
builtin codecvt facets for a few common conversions in support of
Unicode.

You appear to be talking about the contents of files. As far as file-IO
is concerned I was talking about the names of files. The days when
filenames could be assumed to be US-ASCII or at least an 8-bit encoding
like the ISO-8859-* family are ancient history and ATM support for this
sort of thing is practically non-existent as far as the C/C++ standard
libs are concerned.

Andreas

Pete Becker · Nov 16, 2007

You appear to be talking about the contents of files. As far as file-IO
is concerned I was talking about the names of files. The days when
filenames could be assumed to be US-ASCII or at least an 8-bit encoding
like the ISO-8859-* family are ancient history and ATM support for this
sort of thing is practically non-existent as far as the C/C++ standard
libs are concerned.

Wide-character file names were added to the specification for C++0x a
year or so ago. Might I suggest that you look at the draft standard
before complaining about what's not in it? The current draft is
available at
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2461.pdf.

James Kanze · Nov 17, 2007

On 2007-11-16 05:50:47 -0500, James Kanze <[email protected]> said:

In the future this won't be necessary. The next C++ standard will
provide char16_t and char32_t, and appropriate specializations of
std::string, for UTF-16- and UTF-32-encoded characters.

So I've heard.

I've not been following too closely: will the implementation
also be required to provide the appropriate specializations for
basic_iostream et al, the facets, and basic_string?

And will these types be "conditional" (only available if the
implementation decides to support the encoding), or required on
all implementations?

Pete Becker · Nov 17, 2007

So I've heard.

I've not been following too closely: will the implementation
also be required to provide the appropriate specializations for
basic_iostream et al, the facets, and basic_string?

And will these types be "conditional" (only available if the
implementation decides to support the encoding), or required on
all implementations?

Support for UTF-8, UTF-16, and UTF-32 is required, at the level of
having those typedefs, having the appropriate specializations for
basic_string, and a handful of other things to support conversions.
There's a brief sketch at www.versatilecoding.com.

Pete Becker · Nov 17, 2007

Support for UTF-8, UTF-16, and UTF-32 is required, at the level of
having those typedefs, having the appropriate specializations for
basic_string, and a handful of other things to support conversions.
There's a brief sketch at www.versatilecoding.com.

Whoops, char16_t and char32_t will be builtin types, not typedefs.

Custom alphabetical sort	8	Dec 24, 2012
sorting german characters äöü...	0	Oct 30, 2013
Command Line Arguments	0	Mar 7, 2023
Switch case in a JavaScript function	7	Dec 1, 2022
Sorting strings containing special characters (german 'Umlaute')	8	Mar 2, 2007
Using characters from the International Phonetic Alphabet in a C program	0	Sep 21, 2022
characters in python	8	Oct 18, 2006
FAQ 6.23 How can I match strings with multibyte characters?	0	Jan 11, 2011

multibyte characters

TK

Ben Bacarisse

Gianni Mariani

Ron Natalie

James Kanze

TK

Pete Becker

Andreas Dehmel

Pete Becker

Andreas Dehmel

Pete Becker

James Kanze

Pete Becker

Pete Becker

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads