how to convert narrow string to wide string and vice versa?

T

thinktwice

i'm using VC++6 IDE
i know i could use macros like A2T, T2A,
but is there any way more decent way to do this?
 
B

Bart

thinktwice said:
i'm using VC++6 IDE
i know i could use macros like A2T, T2A,
but is there any way more decent way to do this?

Look up std::ctype::widen and std::ctype::narrow in the <locale>
header.

Regards,
Bart.
 
?

=?iso-8859-1?q?Kirit_S=E6lensminde?=

Bart said:
Look up std::ctype::widen and std::ctype::narrow in the <locale>
header.

These may not be much good for Unicode or other variable width
encodings - depends on how you use the resultant strings.

It's a tricky thing to deal with. If you properly understand what you
mean by 'narrow' and 'wide' strings the solution should present itself.
If you're not sure what the string content means then you're unlikely
to find the right solution in a library because you won't know how to
use the functions or their results properly.


K
 
A

Arne 'deice' Pajunen

Kirit said:
These may not be much good for Unicode or other variable width
encodings - depends on how you use the resultant strings.

It's a tricky thing to deal with. If you properly understand what you
mean by 'narrow' and 'wide' strings the solution should present itself.
If you're not sure what the string content means then you're unlikely
to find the right solution in a library because you won't know how to
use the functions or their results properly.

well, if you just want a quick ugly hack, then personally i've sometimes
used:

wstring wide(L"some wide character string");
string narrow(wide.begin(), wide.end());

But this is a cleaving axe for microsurgery: It depends on wide having
equivalent encoding codepoints to the charset in string, which is only
really tru if wstrings are unicode, contain only ISO-8859-1 characters
(0-255), and normal character encoding is ISO-8859-1 or similar. (char
type, depends on platform).

I would actually be interested in seeing what the "clean" solution for
converting is when you have, say, Unicode in wchar_t's and whatever
encoding the locale specifies in char's (ISO-8859-1, or maybe
windows-1252) :)

//deice [deice at deice dot cjb dot net]
//Arne Pajunen
 
?

=?iso-8859-1?q?Kirit_S=E6lensminde?=

Arne said:
well, if you just want a quick ugly hack, then personally i've sometimes
used:

wstring wide(L"some wide character string");
string narrow(wide.begin(), wide.end());

But this is a cleaving axe for microsurgery: It depends on wide having
equivalent encoding codepoints to the charset in string, which is only
really tru if wstrings are unicode, contain only ISO-8859-1 characters
(0-255), and normal character encoding is ISO-8859-1 or similar. (char
type, depends on platform).

I would actually be interested in seeing what the "clean" solution for
converting is when you have, say, Unicode in wchar_t's and whatever
encoding the locale specifies in char's (ISO-8859-1, or maybe
windows-1252) :)

The first step is to convert the UTF-16 (which is normal for wchar_t,
but I think there may be some platforms/compilers that use UTF-32) to
UTF-32. Then convert that down (often with a code table, but sometimes
algorithmically). Of course there's the open question of what to do
with characters that don't/can't map. In some applications you can use
a variety of character encodings (as distinct to character sets). For
example, if you're using ISO-8859-1 in XML/HTML you can use the forms
XML/HTML defines for this.

A full answer depends on what you are using the string for which is why
it's so hard to answer. For some things your solution is perfectly
valid - it's fine for the many parts of internet protocols which are
defined to use ASCII characters only.

For our framework we're looking at using ICU to do the conversions, but
haven't had much of a chance to play with it yet. As nearly 100% of the
interactions we do are through HTTP then we just use UTF-8 and that
solves nearly the whole problem. We have found it useful to define our
own std::wstring like class that uses UTF-32 as the single character
interface points (operator[] and at() etc.) but uses UTF-16 for
character sequences. Things like substr() use the correct position and
count based on the number of UTF-32 characters _not_ the number of
UTF-16 code points so applications can't chop in half some characters.


K
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top