Dealing with string encodings

Stephan Rose · Dec 2, 2007

Question everyone,

I may be slightly off-topic with this but I'm not really sure where else
to go with this.

what's the "best/easiest" ways to deal with string encodings?

Right now, I'm using wstring for all my string operations outside the GUI
and this works really well. Also using it for file io.

Problem is this though: It's defined as wchar_t which inherently isn't a
problem, except that wchar_t is 32-bit under linux and 16-bit under
windows.

That difference right there is going to break my file IO code as I target
both platforms.

utf-8 is another option but I don't terribly like it much due to the fact
that each character can have a different width in bytes.

That is one of the things I like the most about 32-bit wchar_t. No matter
what, each character will always be *one* wchar_t. Makes indexing into a
string simple and painless.

But still, with one platform/compiler (not sure on which level the
difference is) having wchar_t 32-bit and the other 16-bit, this is slowly
turning into a nightmare.

Any suggestions would be very appreciated,

Thanks in advance.

--
Stephan
2003 Yamaha R6

å›ã®äº‹æ€ã„å‡ºã™æ—¥ãªã‚“ã¦ãªã„ã®ã¯
å›ã®äº‹å¿˜ã‚ŒãŸã¨ããŒãªã„ã‹ã‚‰

Alf P. Steinbach · Dec 3, 2007

* Stephan Rose:

Question everyone,

I may be slightly off-topic with this but I'm not really sure where else
to go with this.

what's the "best/easiest" ways to deal with string encodings?

Right now, I'm using wstring for all my string operations outside the GUI
and this works really well. Also using it for file io.

Problem is this though: It's defined as wchar_t which inherently isn't a
problem, except that wchar_t is 32-bit under linux and 16-bit under
windows.

That difference right there is going to break my file IO code as I target
both platforms.

utf-8 is another option but I don't terribly like it much due to the fact
that each character can have a different width in bytes.

That is one of the things I like the most about 32-bit wchar_t. No matter
what, each character will always be *one* wchar_t. Makes indexing into a
string simple and painless.

Whether each character will always be 1 wchar_t depends on the
characters you support.

Essentially, since you're using wchar_t in Windows with that assumption
you have limited your program to Basic Multilingual Plane, which with
that encoding is known as UCS2 (the original 16-bit Unicode).

That's probably not a problem, but not understanding it may be.

But still, with one platform/compiler (not sure on which level the
difference is) having wchar_t 32-bit and the other 16-bit, this is slowly
turning into a nightmare.

Use UTF-8 for i/o.

That also solves a number of interoperability problems for the C++
standard library (exceptions, file names).

Pragmatically you can use wchar_t internally unless you're really doing
Unicode (Chinese etc.) in which case you need to do real Unicode.

Cheers, & hth.,

- Alf

James Kanze · Dec 3, 2007

* Stephan Rose:

As Alf says (more or less), decide on one "universal" encoding
to use internally, and transcode during I/O. The "best"
internal encoding will depend on what you are doing---and on
what any third party libraries you are using support.

There's no rule which requires the same encoding in files as in
the program itself. For a number of reasons, it's probably a
very bad policy to ever write anything which isn't byte encoded
to a file. Practical considerations will probably also lead you
do supporting several different file encodings (especially if
you are working on different platforms).

Whether each character will always be 1 wchar_t depends on the
characters you support.

Essentially, since you're using wchar_t in Windows with that
assumption you have limited your program to Basic Multilingual
Plane, which with that encoding is known as UCS2 (the original
16-bit Unicode).

That's probably not a problem, but not understanding it may be.

It's even more complex than that. Unless you can impose some
normalization form of Unicode, you'll have to deal with the fact
that the same "character" can have several different encodings.
Regardless of the normalization form, some characters will
require several code points---with NFKC, such characters are
rare, however, and if you're only dealing with the major
languages, you can probably ignore them. (Do you really have to
consider a q with a circumflex accent as a single character?)
But you must ensure normalization; otherwise, you have to deal
with compatibility encodings. And unless you ensure NFKC, you
have to deal with many everyday characters (in major European
languages, at least) being represented by two code points.
(E.g. the character "latin small letter e with acute"---very frequent
in French, for example---may be represented as a single code
point, "\u00E9", or as the two code point sequence
"\u0065\u0301". Normalization forms NFC and NFKC require the
former, normalization forms NFD and NFKD the latter.)

Of course, if you start having to do things like case
insensitive sorting, the problems become even more complex; in
France, the character 'ö' (latin small letter o with diaeresis)
is sorted as if it were an 'o'; in Germany, according to DIN, as
if it were the two character sequence "oe". So you end up being
locale dependent as well.

Use UTF-8 for i/o.

If you have a choice. If you have to read files written by
other programs, or legacy files, use whatever encoding they use.
(UTF-8 is definitly the way to go, but not everyone is there
yet.)

That also solves a number of interoperability problems for the
C++ standard library (exceptions, file names).

Pragmatically you can use wchar_t internally unless you're
really doing Unicode (Chinese etc.) in which case you need to
do real Unicode.

I've actually found using UTF-8 internally to be easier than
using UTF-16 or UTF-32. In my case, I need a lot of sparce
tables and bitmaps; UTF-8 lends itself naturally to a trie, with
most of the branches empty. (Of course, you can also use
multilevel tables with UTF-16 or UTF-32.)

Stephan Rose · Dec 3, 2007

On Mon, 03 Dec 2007 01:00:25 -0800, James Kanze wrote:

<snip>

Thanks to the both of you for your information. This is my first time
doing anything with unicode so this is all new territory for me. You've
helped a lot.

--
Stephan
2003 Yamaha R6

å›ã®äº‹æ€ã„å‡ºã™æ—¥ãªã‚“ã¦ãªã„ã®ã¯
å›ã®äº‹å¿˜ã‚ŒãŸã¨ããŒãªã„ã‹ã‚‰

To STL or not to?	16	Oct 10, 2007
STL Template question	4	Oct 20, 2007
Questions on various string literals in c++0x	1	Dec 7, 2010
Lost in encoding stuff	3	Jan 16, 2008
wchar_t is useless	18	Nov 21, 2011
basic_string with unsigned short	5	Oct 28, 2006
unicode mess in c++	12	May 11, 2006
basic_string with unsigned short - initialization and usage	0	Oct 28, 2006

Dealing with string encodings

Stephan Rose

Alf P. Steinbach

James Kanze

Stephan Rose

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads