C++, wchar_t, Unicode and all that stuff

G

gamehack

Hi all,

I was doing a bit of research about writing yet another build tool but
that's not the point of my mail. I'm going to ask a few questions about
how to resolve a few internationalization problems and I'm sorry if
this
is not the right mailing list - I couldn't find any other which was
suited(since my goal is to resolve the problems in a
platform-independent way). The goal is - being able to deal with
different encodings on different platforms with no problems in a
portable fashion. After reading a few articles on the net I realized
that everything boils down to the character size. The problem is
separated into how you manage the chars/strings internally and
externally.

Internally(the way they are put in the source code files and what types
they are stored in):

Using wchar_t:
Basically using wchar_t as the fundamental character type(AFAIK it is
2-4 bytes depending on the platform) and using all correspondent w
functions and streams. The problem is what to do if there is no OS
function which accepts wchar_t. Then I would need to write my own
library to handle the proper conversions(not sure if simple type casts
would do the job). And wchar_t is not said to be in any particular
encoding so I'm bit confused about that. If I write in a source file
wchar_t* st = "something"; what encoding would it be stored as? And
what
about wchar_t* st = L"something"; UTF-8?

Using UTF-8:
I've not seen any articles on how do this(except suggestions to use
long
unsigned to store the chars but what about conversions and passing
strings to OS functions?)

Externally(OS interfaces):
I've completely no idea how to handle this. When you write e.g.
main(int
argc, char** argv) what happens if they pass the arguments as UTF-8
strings? How do you handle that? How do you handle conversion back/from
the internal representation(writing your own library?) Is there
actually
a portable way of doing it?

I'm sorry if this is not the right place to ask these questions but I'm
completely puzzled and thought you guys will be able to point me out to
the right direction. As I said the only thing which I need is to be
able
to communicate with the OS in a transparent manner without worrying
about the encoding and being able to use the future program in complete
UTF-8 environments so any valid UTF-8 could be passed etc. Any
comments/directions/remarks are greatly appreciated.

Regards,
gamehack
 
G

Guest

gamehack said:
If I write in a source file
wchar_t* st = "something"; what encoding would it be stored as? And
what
about wchar_t* st = L"something"; UTF-8?

Let me to quote one of post by Ulrich Eckhardt (from
microsoft.public.windowsce.embedded.vc), here is complete thread
so you can get a better overview of the problem I asked -
http://tinyurl.com/dbhyj:

"It is invalid C or C++ to embed these characters*** into sourcecode.
You are relying on compiler-specific support.
That said, there is a #pragma to tell MSC which codepage you're using."

*** - here Ulrich talks about polish characters I embedded in my code
Using UTF-8:
I've not seen any articles on how do this(except suggestions to use
long
unsigned to store the chars but what about conversions and passing
strings to OS functions?)

"Chapter 2 -An Introduction to Unicode" from following book may be
helpful: http://www.charlespetzold.com/pw5/index.html

Finally, I saw may posts on usenet about how to handle
Unicode/non-Unicode in the same program, etc. and what I can say is that
there seems to be no one and the best solution.
Mainly, I develop for Windows CE platform and I try to follow Charles
Petzold's suggestions presented in his book and it works (but I don't
know if it would work on Unix, because on Unix I hardly ever use Unicode).

Cheers
 
A

Axter

gamehack said:
Hi all,

I was doing a bit of research about writing yet another build tool but
that's not the point of my mail. I'm going to ask a few questions about
how to resolve a few internationalization problems and I'm sorry if
this
is not the right mailing list - I couldn't find any other which was
suited(since my goal is to resolve the problems in a
platform-independent way). The goal is - being able to deal with
different encodings on different platforms with no problems in a
portable fashion. After reading a few articles on the net I realized
that everything boils down to the character size. The problem is
separated into how you manage the chars/strings internally and
externally.

Internally(the way they are put in the source code files and what types
they are stored in):

Using wchar_t:
Basically using wchar_t as the fundamental character type(AFAIK it is
2-4 bytes depending on the platform) and using all correspondent w
functions and streams. The problem is what to do if there is no OS
function which accepts wchar_t. Then I would need to write my own
library to handle the proper conversions(not sure if simple type casts
would do the job). And wchar_t is not said to be in any particular
encoding so I'm bit confused about that. If I write in a source file
wchar_t* st = "something"; what encoding would it be stored as? And
what
about wchar_t* st = L"something"; UTF-8?

Using UTF-8:
I've not seen any articles on how do this(except suggestions to use
long
unsigned to store the chars but what about conversions and passing
strings to OS functions?)

Externally(OS interfaces):
I've completely no idea how to handle this. When you write e.g.
main(int
argc, char** argv) what happens if they pass the arguments as UTF-8
strings? How do you handle that? How do you handle conversion back/from
the internal representation(writing your own library?) Is there
actually
a portable way of doing it?
Both the C and C++ standards have a portable function for convert ANSI
and wide charater strings.
Check your man page for wcstombs and mbstowcs.
Example code:
wifstream wide_file(FileWithWideChar);
wstring TmpLineData;
string CmpFileData_InAnsi, AnsiTmpLine;
while(getline(wide_file, TmpLineData))
{
AnsiTmpLine.resize(TmpLineData.size(), 0);
wcstombs(AnsiTmpLine.begin(), TmpLineData.begin(),
TmpLineData.size());
CmpFileData_InAnsi += AnsiTmpLine + "\n";
}

ofstream ansi_file(FileWithAnsiChar);
ansi_file.write(CmpFileData_InAnsi.begin(),
CmpFileData_InAnsi.size());
ansi_file << endl;
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,482
Members
44,900
Latest member
Nell636132

Latest Threads

Top