string encoding in C++

A

Allen

Hi, I am transporting a c++ program from win32 to ibm aix 5.3.
There is a file name Measurement.cpp which contains some string, for
example:

static std::wstring breaker = L"开关";

The Measurement.cpp is encoding in UTF-8;
The transporting procedure is as following:
1.change Measurement.cpp encoding type to be GB18030 as aix 5.3
needed.
2.write a subfunction name ws2s:
std::string ws2s(const std::wstring & src) {
const int dsize = 2 * src.size() + 1;
char * buff = new char[dsize];
memset(buff, 0, dsize);
setlocale(LC_ALL, "");
wcstombs(buff, src.c_str(), dsize);
setlocale(LC_ALL, "C");
std::string result = buff;
delete[] buff;
buff = NULL;
return result;
}
3.output the breaker
std::cout << ws2s(breaker) << std::endl;

But the output text is not correctly display.

Would you please help me? Thank you.
 
F

Francesco S. Carta

Hi, I am transporting a c++ program from win32 to ibm aix 5.3.
There is a file name Measurement.cpp which contains some string, for
example:

static std::wstring breaker = L"开关";

The Measurement.cpp is encoding in UTF-8;
The transporting procedure is as following:
1.change Measurement.cpp encoding type to be GB18030 as aix 5.3
needed.
2.write a subfunction name ws2s:
std::string ws2s(const std::wstring& src) {
const int dsize = 2 * src.size() + 1;
char * buff = new char[dsize];
memset(buff, 0, dsize);
setlocale(LC_ALL, "");
wcstombs(buff, src.c_str(), dsize);
setlocale(LC_ALL, "C");
std::string result = buff;
delete[] buff;
buff = NULL;
return result;
}
3.output the breaker
std::cout<< ws2s(breaker)<< std::endl;

But the output text is not correctly display.

Would you please help me? Thank you.

Try outputting to a text file and open it with a non-console editor, if
the characters appear correct in that editor, maybe you have to change
encoding and font for your console.

The problem could be also in the way you have changed the encoding of
that file - does it appear fine, when you reopen it?

I didn't test your program - just for the records - and this issue
doesn't seem to be C++ specific, maybe you could consider posting it to
some platform specific group, just to increase the chances to get
sensible help.
 
A

Allen

on 01/07/2010 09:25:05 said:
Hi, I am transporting a c++ program from win32 to ibm aix 5.3.
There is a file name Measurement.cpp which contains some string, for
example:
static std::wstring breaker = L"开关";
The Measurement.cpp is encoding in UTF-8;
The transporting procedure is as following:
1.change Measurement.cpp encoding type to be GB18030 as aix 5.3
needed.
2.write a subfunction name ws2s:
std::string ws2s(const std::wstring&  src) {
   const int dsize = 2 * src.size() + 1;
   char * buff = new char[dsize];
   memset(buff, 0, dsize);
   setlocale(LC_ALL, "");
   wcstombs(buff, src.c_str(), dsize);
   setlocale(LC_ALL, "C");
   std::string result = buff;
   delete[] buff;
   buff = NULL;
   return result;
}
3.output the breaker
std::cout<<  ws2s(breaker)<<  std::endl;
But the output text is not correctly display.
Would you please help me? Thank you.

Try outputting to a text file and open it with a non-console editor, if
the characters appear correct in that editor, maybe you have to change
encoding and font for your console.

The problem could be also in the way you have changed the encoding of
that file - does it appear fine, when you reopen it?

I didn't test your program - just for the records - and this issue
doesn't seem to be C++ specific, maybe you could consider posting it to
some platform specific group, just to increase the chances to get
sensible help.

Thank you for your reply.
Both the output of the console and the file are not correct.
I noticed that it is ok when open the file in vi console.

Allen
 
A

Allen

Allen said:
Hi, I am transporting a c++ program from win32 to ibm aix 5.3.
There is a file name Measurement.cpp which contains some string, for
example:
static std::wstring breaker = L"开关";
The Measurement.cpp is encoding in UTF-8;
The transporting procedure is as following:
1.change Measurement.cpp encoding type to be GB18030 as aix 5.3
needed.
2.write a subfunction name ws2s:
std::string ws2s(const std::wstring & src) {
   const int dsize = 2 * src.size() + 1;
   char * buff = new char[dsize];
   memset(buff, 0, dsize);
   setlocale(LC_ALL, "");
   wcstombs(buff, src.c_str(), dsize);
   setlocale(LC_ALL, "C");
   std::string result = buff;
   delete[] buff;
   buff = NULL;
   return result;
}
3.output the breaker
std::cout << ws2s(breaker) << std::endl;
But the output text is not correctly display.
Would you please help me? Thank you.

Three possible reasons:

1) When you compile Measurement.cpp, your C++ compiler must be aware that
this module uses GB18030. Check your compiler's documentation.

2) At runtime, your locale does not match the encoding using by your display
terminal.

3) Your C++ library does not implement the encoding used by your locale.

You can find out the answer yourself, by printing the contents of your
std::wstring first, as numerical wchar_t's, and verifying their unicode
values, presuming that your C++ library puts UTF-16 ot UTF-32 into your
wchar_t's; and by printing the contents of your converted string buffer, as
numerical chars, and verifying that their encoding is correct.

 application_pgp-signature_part
< 1K查看下载

Thank you for the detailed answer.
It is strange that the part of string read from xml file by xerces-c
is displayed ok,
while the part of constant breaker string is not correct.
To illustrate it, I write the example codes as following:
std::wstring prefix = xercesc-c...getAttributeText(...);
std::wstring breaker = L"开关";
std::wstring name = prefix + breaker;
I output the name into a file, and prefix will be correct, but breaker
not correct.

So I don't understand two things.
1. how does source file encoding relate constant string, i.e. L"开关"?
2. what type encoding does std::wstring use?

Thank you again.
Allen
 
F

Francesco S. Carta

J

Jorgen Grahn

Hi, I am transporting a c++ program from win32 to ibm aix 5.3.
There is a file name Measurement.cpp which contains some string, for
example: ....
2.write a subfunction name ws2s:
std::string ws2s(const std::wstring & src) {
const int dsize = 2 * src.size() + 1;
char * buff = new char[dsize];
memset(buff, 0, dsize);
setlocale(LC_ALL, "");
wcstombs(buff, src.c_str(), dsize);
setlocale(LC_ALL, "C");
std::string result = buff;
delete[] buff;
buff = NULL;
return result;
}
....

I haven't read the rest, but those setlocale() calls are a bit scary:
they affect local state, they don't reset to the original state
(unless it was "C") and I have a feeling they may be expensive (but
profile them yourself).

If you need to mess with locales, there's a better interface in C++
which may help. The chapter of "The C++ Programming Language" which
describes it is freely downloadable:

http://www2.research.att.com/~bs/3rd_loc0.html

/Jorgen
 
J

James Kanze

Allen said:
Allen writes:
Hi, I am transporting a c++ program from win32 to ibm aix
5.3. There is a file name Measurement.cpp which contains
some string, for example:
static std::wstring breaker = L"开关";
The Measurement.cpp is encoding in UTF-8; The
transporting procedure is as following:
1.change Measurement.cpp encoding type to be GB18030 as aix 5.3
needed.
2.write a subfunction name ws2s:
std::string ws2s(const std::wstring & src) {
const int dsize = 2 * src.size() + 1;
char * buff = new char[dsize];
memset(buff, 0, dsize);
setlocale(LC_ALL, "");
wcstombs(buff, src.c_str(), dsize);
setlocale(LC_ALL, "C");
std::string result = buff;
delete[] buff;
buff = NULL;
return result;
}
3.output the breaker
std::cout << ws2s(breaker) << std::endl;
But the output text is not correctly display.
Would you please help me? Thank you.
Three possible reasons:
1) When you compile Measurement.cpp, your C++ compiler must
be aware that this module uses GB18030. Check your
compiler's documentation.
2) At runtime, your locale does not match the encoding
using by your display terminal.
3) Your C++ library does not implement the encoding used by
your locale.

How can you set a locale which isn't installed?
Generally, XML parsers expect XML document to use UTF-8. If an
XML document uses a different encoding, it would specify it in
the <?xml … > processing instruction.

I'd also be curious as to what characters are involved. It's
very frequent to have XML files which don't contain Chinese
characters, or only contain them in CDATA sections. If the
characters he's displaying from Xerces correspond to ASCII, then
it's not surprising that they display correctly.

In order to begin analysing such a problem, it's necessary to
know 1) what should be output, and 2) what actually is output.
In both cases, the actual numerical values of the bytes, not
what is being displayed by some display engine.
This is implementation defined. Most C++ libraries use UTF-16
or UTF-32.

I think you're being over optimistic about Unicode use---the
last time I had access to non Windows machines, Solaris (and Sun
CC) still didn't use Unicode.

Still, I think AIX is UTF-16. And I'm pretty sure that the
compiler doesn't use GB 18030 by default.

(As a general rule, I'd recommend using a Unicode format
internally regardless, and translating to GB 18030, if
necessary, on input and output.)
Another factor is what your compiler thinks is the character
coding of the C++ source.
The same one.

The same one as what? In practice, std::wstring can probably
handle any encoding which will fit, the compiler ignores the
encoding, except for wide character literals, and the library
will use whatever encoding is specified by the locale it uses in
a given function---which isn't necessarily the same as the one
the compiler used when it interpreted the wide character
literals.
Again: you will find the answer to your questions by printing
out the numerical values of your wide and narrow character
strings, using a test program, instead of guessing as to
what's going on.

I'd use two steps: print out the numerical values in the
program, and dump the numerical values of the bytes in the file.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,070
Latest member
BiogenixGummies

Latest Threads

Top