Converting from std::wstring to UTF-8 std::string

P

Peter Poulsen

Hi

I'm trying to convert a std::wstring to std::string UTF-8 encoded. I
have made a simple function that does the trick for three letters but
it can hardly be considered a generic solution.

<code>
std::string wchar2utf8(std::wstring const& wstr)
{
std::string str;
std::wstring::const_iterator itr = wstr.begin();
while(itr != wstr.end()) {
switch(*itr) {
case 0x00f8: // ø
str.push_back(0xc3);
str.push_back(0xb8);
break;
case 0x00e5: // å
str.push_back(0xc3);
str.push_back(0xa5);
break;
case 0x00e6: // æ
str.push_back(0xc3);
str.push_back(0xa6);
break;
default:
str.push_back(*itr);
break;
}
++itr;
}

return str;
}
</code>

I have tried google but it is extremely hard to find a good
explanation of how to do it. Is there somebody that can give an
example of a generic solution for my wchar2utf8() function?

Yours
/peter
 
V

Victor Bazarov

I'm trying to convert a std::wstring to std::string UTF-8 encoded. I
have made a simple function that does the trick for three letters but
[...]

Isn't there a library that you could use? I can't imagine nobody before
you have thought of the need to do that...

V
 
N

Nobody

Is there somebody that can give an example of a generic solution for my
wchar2utf8() function?

The following function converts a single Unicode character to UTF-8. Note
that wchar_t may or may not be a Unicode character (particularly on
Windows, which can't make up its mind whether wide strings are UCS-2 or
UTF-16).

#include <string>

void unicode2utf(std::string& str, unsigned int c) {
#define ADD(str, c, flags, shift) \
str.push_back((char) (flags | ((c >> shift) & 0x3F)))

if (c < 0x80U)
str.push_back((char) c);
else if (c < 0x800U) {
ADD(str, c, 0xC0, 6);
ADD(str, c, 0x80, 0);
}
else if (c < 0x10000U) {
ADD(str, c, 0xE0, 12);
ADD(str, c, 0x80, 6);
ADD(str, c, 0x80, 0);
}
else if (c < 0x200000U) {
ADD(str, c, 0xF0, 18);
ADD(str, c, 0x80, 12);
ADD(str, c, 0x80, 6);
ADD(str, c, 0x80, 0);
}
else if (c < 0x4000000U) {
ADD(str, c, 0xF8, 24);
ADD(str, c, 0x80, 18);
ADD(str, c, 0x80, 12);
ADD(str, c, 0x80, 6);
ADD(str, c, 0x80, 0);
}
else if (c < 0x80000000U) {
ADD(str, c, 0xFC, 30);
ADD(str, c, 0x80, 24);
ADD(str, c, 0x80, 18);
ADD(str, c, 0x80, 12);
ADD(str, c, 0x80, 6);
ADD(str, c, 0x80, 0);
}
#undef ADD
}
 
J

Joshua Maurice

First you have to know/decide what is stored in std::wstring. Is it UTF-
16 or UCS-2? UTF-16 needs to be first decoded.

Note that wchar_t is 32 bits on most unix-like operating systems, and
by convention wstrings store UTF-32 strings. Windows is the oddball
desktop/server with 16 bit wchar_t.
 
M

Marc

Joshua said:
Note that wchar_t is 32 bits on most unix-like operating systems, and
by convention wstrings store UTF-32 strings. Windows is the oddball
desktop/server with 16 bit wchar_t.

Yes, good thing we have char32_t now (er, soon at least).
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,009
Latest member
GidgetGamb

Latest Threads

Top