Wide character to multi-byte

PEK · Jan 5, 2005

I need some code that convert a multi-byte string to a Unicode string,
and Unicode to multi-byte. I work mostly in Windows and know how to
solve it there, but I would like to have some platform independent
code too.

I have tried with mbtowcs/wctombs but I'm not satisfied with the
result. If wctombs finds a character that can't be converted it return
-1, and stops. I would like to replace such of characters with some
special character and convert so much that is possible.

So I have written my own functions, based on mbtowc and wctomb. I have
successfully converted text from and to different codepages (I have
tried 437, 1252 and 949 [Korean, with some characters that takes two
bytes]). So I think the code is OK, but I would appreciate if someone
else look at it (so I have someone to blame ;-).

The code:

void ConvertCharToWstring(const char* from, wstring &to)
{
to = L"";

size_t pos=0;
wchar_t temp[1];

while(true)
{
size_t len = mbtowc(temp, from+pos, MB_CUR_MAX);

//Found end
if(len == 0)
return;
else if(len == (size_t)-1)
{
//Unknown character, this should never happen
pos++;
}
else
{
to += temp[0];
pos += len;
}
}
}

void ConvertWcharToString
(const wchar_t* from, string &to,
bool* datalost, char unknownchar)
{
to = "";

char* temp = new char[MB_CUR_MAX];

while(*from != L'\0')
{
size_t len = wctomb(temp, *from);

//Found end
if(len == 0)
break;
else if(len == (size_t)-1)
{
//Replace with unknown character
to += unknownchar;

if(datalost != NULL)
*datalost=true;
}
else
{
//Copy all characters
for(size_t i=0; i<len; i++)
to += temp;
}

from++;
}

delete [] temp;
}

/PEK

Unforgiven · Jan 5, 2005

PEK said:
I need some code that convert a multi-byte string to a Unicode string,
and Unicode to multi-byte. I work mostly in Windows and know how to
solve it there, but I would like to have some platform independent
code too.
/PEK

// wide-char to multibyte:
wstring source = "something";
typedef ctype<wchar_t> CT;
size_t length = source.length();
char *result = new char[length];
CT const& ct = use_facet<CT>(locale());
ct.narrow(source.data(), source.data() + source.size(), 'X', result);
string dest(result, length);
delete[] result;
return dest;

For the reverse, use ct.widen instead (and make source a string and dest a
wstring of course).
This uses the global C locale, which at program startup is ASCII, *not* the
system locale. To set a specific locale, use:
locale::global(locale("Dutch_Netherlands"));
At least on Windows with VC, this sets the global locale to the system
locale:
locale::global(locale(""));

Note that this won't handle actual multi-byte character sets, i.e. character
sets with characters > 256 (e.g. JIS), those characters will not get
converted properly. I know of no standard way to handle those, just the
WideCharToMultiByte windows method.

Jonathan Turkanis · Jan 5, 2005

PEK said:
I need some code that convert a multi-byte string to a Unicode string,
and Unicode to multi-byte. I work mostly in Windows and know how to
solve it there, but I would like to have some platform independent
code too.

The standard C++ solution is to use codecvt facets. Currently these are a bit
hard to use, but there is a proposal to add several components which would make
it easier. See

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2004/n1683.html.

In the meantime, both the Boost Serialization library and the soon-to-be-relased
Boost Iostreams

http://home.comcast.net/~jturkanis/iostreams/libs/iostreams/doc/?path=5.6

library contain code conversion components. (The documentation for the iostreams
code conversion component is temporarily out-of-sync with the source.)

You can also use the Dinkumware CoreX library, which is reasonably priced and is
the basis for n1683.

Jonathan

Jonathan Turkanis · Jan 5, 2005

Note that this won't handle actual multi-byte character sets, i.e.
character sets with characters > 256 (e.g. JIS), those characters
will not get converted properly. I know of no standard way to handle
those, just the WideCharToMultiByte windows method.

Using mbtowcs/wctombs *is* a standard way to handle multibyte characters. The
prefered C++ solution is to use a codecvt facet instead of a ctype facet.

Jonathan

Unforgiven · Jan 5, 2005

Jonathan Turkanis said:
Using mbtowcs/wctombs *is* a standard way to handle multibyte characters.

That I knew, but it has the drawback of bolting on unrecognized characters
instead of replacing them with some predetermined character (like '?'), as
the OP mentioned.

The
prefered C++ solution is to use a codecvt facet instead of a ctype facet.

That I didn't know.

PEK · Jan 6, 2005

That I knew, but it has the drawback of bolting on unrecognized characters
instead of replacing them with some predetermined character (like '?'), as
the OP mentioned.

A workaround for this is to use mbtowc/wctomb instead and convert the
characters in a loop. This was my solution and it seems to work, or is
there some problems with it?

That I didn't know.

The code Unforgiven it's a bit obscure, but I think I understand most
of it. But I also want to detect if an unrecognized character was
replaced (I guess I didn't mention that in my earlier post). Another
problem with the code is that I suppose it's hard to calculate the
length of the result when multibyte characters will be used.

/PEK

wide character file to wstring - unexpected results	1	Dec 14, 2011
Problem with displaying character that code number is 219 (after SetConsoleTextAttribute)?	3	Jan 9, 2023
Outputting signal values to terminal Within Character Array	0	Dec 10, 2021
Multi-character constants	2	Jul 9, 2008
Wide characters and streams	3	Sep 30, 2006
Wide character input/output	14	Feb 23, 2008
In C, the longest palindromic subsequence multithread exists	0	Nov 23, 2022
how to add pad byte	11	Mar 25, 2012

Wide character to multi-byte

PEK

Unforgiven

Jonathan Turkanis

Jonathan Turkanis

Unforgiven

PEK

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads