Dealing with string encodings

S

Stephan Rose

Question everyone,

I may be slightly off-topic with this but I'm not really sure where else
to go with this.

what's the "best/easiest" ways to deal with string encodings?

Right now, I'm using wstring for all my string operations outside the GUI
and this works really well. Also using it for file io.

Problem is this though: It's defined as wchar_t which inherently isn't a
problem, except that wchar_t is 32-bit under linux and 16-bit under
windows.

That difference right there is going to break my file IO code as I target
both platforms.

utf-8 is another option but I don't terribly like it much due to the fact
that each character can have a different width in bytes.

That is one of the things I like the most about 32-bit wchar_t. No matter
what, each character will always be *one* wchar_t. Makes indexing into a
string simple and painless.

But still, with one platform/compiler (not sure on which level the
difference is) having wchar_t 32-bit and the other 16-bit, this is slowly
turning into a nightmare.

Any suggestions would be very appreciated,

Thanks in advance.

--
Stephan
2003 Yamaha R6

å›ã®äº‹æ€ã„出ã™æ—¥ãªã‚“ã¦ãªã„ã®ã¯
å›ã®äº‹å¿˜ã‚ŒãŸã¨ããŒãªã„ã‹ã‚‰
 
A

Alf P. Steinbach

* Stephan Rose:
Question everyone,

I may be slightly off-topic with this but I'm not really sure where else
to go with this.

what's the "best/easiest" ways to deal with string encodings?

Right now, I'm using wstring for all my string operations outside the GUI
and this works really well. Also using it for file io.

Problem is this though: It's defined as wchar_t which inherently isn't a
problem, except that wchar_t is 32-bit under linux and 16-bit under
windows.

That difference right there is going to break my file IO code as I target
both platforms.

utf-8 is another option but I don't terribly like it much due to the fact
that each character can have a different width in bytes.

That is one of the things I like the most about 32-bit wchar_t. No matter
what, each character will always be *one* wchar_t. Makes indexing into a
string simple and painless.

Whether each character will always be 1 wchar_t depends on the
characters you support.

Essentially, since you're using wchar_t in Windows with that assumption
you have limited your program to Basic Multilingual Plane, which with
that encoding is known as UCS2 (the original 16-bit Unicode).

That's probably not a problem, but not understanding it may be.


But still, with one platform/compiler (not sure on which level the
difference is) having wchar_t 32-bit and the other 16-bit, this is slowly
turning into a nightmare.

Use UTF-8 for i/o.

That also solves a number of interoperability problems for the C++
standard library (exceptions, file names).

Pragmatically you can use wchar_t internally unless you're really doing
Unicode (Chinese etc.) in which case you need to do real Unicode.


Cheers, & hth.,

- Alf
 
J

James Kanze

* Stephan Rose:

As Alf says (more or less), decide on one "universal" encoding
to use internally, and transcode during I/O. The "best"
internal encoding will depend on what you are doing---and on
what any third party libraries you are using support.

There's no rule which requires the same encoding in files as in
the program itself. For a number of reasons, it's probably a
very bad policy to ever write anything which isn't byte encoded
to a file. Practical considerations will probably also lead you
do supporting several different file encodings (especially if
you are working on different platforms).
Whether each character will always be 1 wchar_t depends on the
characters you support.
Essentially, since you're using wchar_t in Windows with that
assumption you have limited your program to Basic Multilingual
Plane, which with that encoding is known as UCS2 (the original
16-bit Unicode).
That's probably not a problem, but not understanding it may be.

It's even more complex than that. Unless you can impose some
normalization form of Unicode, you'll have to deal with the fact
that the same "character" can have several different encodings.
Regardless of the normalization form, some characters will
require several code points---with NFKC, such characters are
rare, however, and if you're only dealing with the major
languages, you can probably ignore them. (Do you really have to
consider a q with a circumflex accent as a single character?)
But you must ensure normalization; otherwise, you have to deal
with compatibility encodings. And unless you ensure NFKC, you
have to deal with many everyday characters (in major European
languages, at least) being represented by two code points.
(E.g. the character "latin small letter e with acute"---very frequent
in French, for example---may be represented as a single code
point, "\u00E9", or as the two code point sequence
"\u0065\u0301". Normalization forms NFC and NFKC require the
former, normalization forms NFD and NFKD the latter.)

Of course, if you start having to do things like case
insensitive sorting, the problems become even more complex; in
France, the character 'ö' (latin small letter o with diaeresis)
is sorted as if it were an 'o'; in Germany, according to DIN, as
if it were the two character sequence "oe". So you end up being
locale dependent as well.
Use UTF-8 for i/o.

If you have a choice. If you have to read files written by
other programs, or legacy files, use whatever encoding they use.
(UTF-8 is definitly the way to go, but not everyone is there
yet.)
That also solves a number of interoperability problems for the
C++ standard library (exceptions, file names).
Pragmatically you can use wchar_t internally unless you're
really doing Unicode (Chinese etc.) in which case you need to
do real Unicode.

I've actually found using UTF-8 internally to be easier than
using UTF-16 or UTF-32. In my case, I need a lot of sparce
tables and bitmaps; UTF-8 lends itself naturally to a trie, with
most of the branches empty. (Of course, you can also use
multilevel tables with UTF-16 or UTF-32.)
 
S

Stephan Rose

On Mon, 03 Dec 2007 01:00:25 -0800, James Kanze wrote:

<snip>

Thanks to the both of you for your information. This is my first time
doing anything with unicode so this is all new territory for me. You've
helped a lot.

--
Stephan
2003 Yamaha R6

å›ã®äº‹æ€ã„出ã™æ—¥ãªã‚“ã¦ãªã„ã®ã¯
å›ã®äº‹å¿˜ã‚ŒãŸã¨ããŒãªã„ã‹ã‚‰
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top