unicode (UCS-2 encoded)

wael · Aug 22, 2003

hello all,
i want convert w_char to UCS2 encoded (0041) this is a char encoded UCS2
please look at this
http://www.unicode.org/charts/
http://www.unicode.org/

every language has a chart
bye example
char 'A' = 0041--> (UCS encoded)

char 'any other language' = 0628 (this is in differ language)
i hope you uderstand what i mean

Jason · Aug 22, 2003

Do you mean that you want to convert locale specific strings like ASCII,
utf8, big5, etc into unicode UCS2 two byte entities, and then store them in
a wchar_t?

When porting a web browser, we had a type tChar16 to put unicode into; it's
worthwhile using typedefs anyway, even though C++ has wchar_t. We wrote our
own language conversion libraries, it depends what you want to do. It's
probably not so much of a C++ question. Look up internationalization on the
web, in relation to C++

wael · Aug 24, 2003

Jason said:
Do you mean that you want to convert locale specific strings like ASCII,
utf8, big5, etc into unicode UCS2 two byte entities, and then store them in
a wchar_t?

When porting a web browser, we had a type tChar16 to put unicode into; it's
worthwhile using typedefs anyway, even though C++ has wchar_t. We wrote our
own language conversion libraries, it depends what you want to do. It's
probably not so much of a C++ question. Look up internationalization on the
web, in relation to C++

i have wchar_t charcter
suppose it is (16 bit) and like this

wchar_t ch = L'A';
//char A encoded UCS-2 is 0041
how do i get ch in 2 bytes UCS encoded ??

some data stored in data base UCS-2
0041 0065 etc ... 0628 0627 ....
thank you for take time read this

John Harrison · Aug 24, 2003

wael said:
i have wchar_t charcter
suppose it is (16 bit) and like this

wchar_t ch = L'A';
//char A encoded UCS-2 is 0041
how do i get ch in 2 bytes UCS encoded ??

some data stored in data base UCS-2
0041 0065 etc ... 0628 0627 ....
thank you for take time read this

Like this?

wchar_t ch = L'A';
char bytes[2];
bytes[0] = ch/256; // byte[0] == 0x00
bytes[1] = ch%256; // byte[1] == 0x41

Still not completely clear what you are trying to do.

john

Jason · Aug 24, 2003

I am afraid I don't understand the question either. Perhaps you can tell us
what data your reading, and if it's in Unicode or ASCII and what you want to
do with it. Instead of talking about wchar_t's and UCS-2 straight away.

John Harrison · Aug 24, 2003

wael said:
"John Harrison" <[email protected]> wrote in message

wael said:

Do you mean that you want to convert locale specific strings like ASCII,
utf8, big5, etc into unicode UCS2 two byte entities, and then store

Click to expand...

them
in

a wchar_t?

Click to expand...

Like this?

wchar_t ch = L'A';
char bytes[2];
bytes[0] = ch/256; // byte[0] == 0x00
bytes[1] = ch%256; // byte[1] == 0x41

Still not completely clear what you are trying to do.

john

Click to expand...

thank you for help ,
let me be more clear:-
1- i receive text from ascii socket like this ( i will assume it is on
{0041 , 0628 , 0627 , 0042}
each 4 chars is UCS-2 encoded hex as defined in
http://www.unicode.org/charts/
00 prifx for english 41 char as defined in chart
http://www.unicode.org/charts/PDF/U0000.pdf
06 prifx for arabic 28 is char as defined in chart
http://www.unicode.org/charts/PDF/U0600.pdf
also i receive data for other languages like greek

i want convert incoming string to wchar_t and from wchar_t to send it
by socket

OK, try again, perhaps like this?

ascii_data is your string of unicode numbers seperated by commas and with a
leading { and trailing }. I.e. what you read from the socket. At the end
unicode_data is a wide string of Unicode characters, you can write that to
your other socket.

#include <algorithm>
#include <istream>
#include <sstream>
#include <string>

int main()
{
std::string ascii_data = "{0041 , 0628 , 0627 , 0042}";
// remove leading and trailing {}
ascii_data.erase(ascii_data.begin());
ascii_data.erase(ascii_data.end() - 1);
// replace commas with spaces
std::replace(ascii_data.begin(), ascii_data.end(), ',', ' ');
// use string as stream
std::istringstream str(ascii_data);
// read hex numbers from stream
std::wstring unicode_data;
unsigned char_value;
while (str >> std::hex >> char_value)
{
unicode_data.push_back(static_cast<wchar_t>(char_value));
}
}

john

wael · Aug 25, 2003

John Harrison said:
wael said:

"John Harrison" <[email protected]> wrote in message

Do you mean that you want to convert locale specific strings like ASCII,
utf8, big5, etc into unicode UCS2 two byte entities, and then store

Click to expand...

them
in

a wchar_t?

Like this?

wchar_t ch = L'A';
char bytes[2];
bytes[0] = ch/256; // byte[0] == 0x00
bytes[1] = ch%256; // byte[1] == 0x41

Still not completely clear what you are trying to do.

john

Click to expand...

thank you for help ,
let me be more clear:-
1- i receive text from ascii socket like this ( i will assume it is on
{0041 , 0628 , 0627 , 0042}
each 4 chars is UCS-2 encoded hex as defined in
http://www.unicode.org/charts/
00 prifx for english 41 char as defined in chart
http://www.unicode.org/charts/PDF/U0000.pdf
06 prifx for arabic 28 is char as defined in chart
http://www.unicode.org/charts/PDF/U0600.pdf
also i receive data for other languages like greek

i want convert incoming string to wchar_t and from wchar_t to send it
by socket

Click to expand...

OK, try again, perhaps like this?

ascii_data is your string of unicode numbers seperated by commas and with a
leading { and trailing }. I.e. what you read from the socket. At the end
unicode_data is a wide string of Unicode characters, you can write that to
your other socket.

#include <algorithm>
#include <istream>
#include <sstream>
#include <string>

int main()
{
std::string ascii_data = "{0041 , 0628 , 0627 , 0042}";
// remove leading and trailing {}
ascii_data.erase(ascii_data.begin());
ascii_data.erase(ascii_data.end() - 1);
// replace commas with spaces
std::replace(ascii_data.begin(), ascii_data.end(), ',', ' ');
// use string as stream
std::istringstream str(ascii_data);
// read hex numbers from stream
std::wstring unicode_data;
unsigned char_value;
while (str >> std::hex >> char_value)
{
unicode_data.push_back(static_cast<wchar_t>(char_value));
}
}

john

Thank you for your help \
the code you send me is cool but for sorry it works for english only
look at this code:
//this what notepad do
BYTE aa[2];
//header for text notepad.exe it tells notepad.exe that file type is unicode
aa[0] = 0xFE;
aa[1] = 0xFF;
//end header
// my problem is here how to encode 06 28 into wchar
// or how notepad read this file
//this codes is valid and if you have installed arabic language in win2000 or nt
//you will see it fine
//hex vaue is not like english
// by mean hex('A') = 41 is not the same for other languages
//by mean hex '\x28' is not the same if i get hex(ascii code of its char)
//and this is the real problem
BYTE bb[4]; // 4 bytes = 2 bytes unicode
bb[0] = '\x06'; //unicode hex encoding prefix
bb[1] = '\x28'; //unicode hex encoding
bb[2] = '\x06'; //unicode hex encoding prefix
bb[3] = '\x27'; //unicode hex encoding

FILE *stream;
stream = fopen( "c:\\fprintf.txt", "w" );
fwrite( aa, sizeof(BYTE), 2, stream ); //write header
fwrite( bb, sizeof(BYTE), 4, stream ); //write data
fcloseall();//close stream

thank you for take time read this
wael ahmed

John Harrison · Aug 25, 2003

wael said:
"John Harrison" <[email protected]> wrote in message

wael said:

"John Harrison" <[email protected]> wrote in message

Click to expand...

Do you mean that you want to convert locale specific strings

Click to expand...

like
ASCII,

utf8, big5, etc into unicode UCS2 two byte entities, and then

Click to expand...

store
them
in

a wchar_t?

Like this?

wchar_t ch = L'A';
char bytes[2];
bytes[0] = ch/256; // byte[0] == 0x00
bytes[1] = ch%256; // byte[1] == 0x41

Still not completely clear what you are trying to do.

john

thank you for help ,
let me be more clear:-
1- i receive text from ascii socket like this ( i will assume it is on
{0041 , 0628 , 0627 , 0042}
each 4 chars is UCS-2 encoded hex as defined in
http://www.unicode.org/charts/
00 prifx for english 41 char as defined in chart
http://www.unicode.org/charts/PDF/U0000.pdf
06 prifx for arabic 28 is char as defined in chart
http://www.unicode.org/charts/PDF/U0600.pdf
also i receive data for other languages like greek

i want convert incoming string to wchar_t and from wchar_t to send it
by socket

Click to expand...

OK, try again, perhaps like this?

ascii_data is your string of unicode numbers seperated by commas and with a
leading { and trailing }. I.e. what you read from the socket. At the end
unicode_data is a wide string of Unicode characters, you can write that to
your other socket.

#include <algorithm>
#include <istream>
#include <sstream>
#include <string>

int main()
{
std::string ascii_data = "{0041 , 0628 , 0627 , 0042}";
// remove leading and trailing {}
ascii_data.erase(ascii_data.begin());
ascii_data.erase(ascii_data.end() - 1);
// replace commas with spaces
std::replace(ascii_data.begin(), ascii_data.end(), ',', ' ');
// use string as stream
std::istringstream str(ascii_data);
// read hex numbers from stream
std::wstring unicode_data;
unsigned char_value;
while (str >> std::hex >> char_value)
{
unicode_data.push_back(static_cast<wchar_t>(char_value));
}
}

john

Click to expand...

Thank you for your help \
the code you send me is cool but for sorry it works for english only
look at this code:
//this what notepad do
BYTE aa[2];
//header for text notepad.exe it tells notepad.exe that file type is unicode
aa[0] = 0xFE;
aa[1] = 0xFF;
//end header
// my problem is here how to encode 06 28 into wchar
// or how notepad read this file
//this codes is valid and if you have installed arabic language in win2000 or nt
//you will see it fine
//hex vaue is not like english
// by mean hex('A') = 41 is not the same for other languages
//by mean hex '\x28' is not the same if i get hex(ascii code of its char)
//and this is the real problem
BYTE bb[4]; // 4 bytes = 2 bytes unicode
bb[0] = '\x06'; //unicode hex encoding prefix
bb[1] = '\x28'; //unicode hex encoding
bb[2] = '\x06'; //unicode hex encoding prefix
bb[3] = '\x27'; //unicode hex encoding

FILE *stream;
stream = fopen( "c:\\fprintf.txt", "w" );
fwrite( aa, sizeof(BYTE), 2, stream ); //write header
fwrite( bb, sizeof(BYTE), 4, stream ); //write data
fcloseall();//close stream

thank you for take time read this
wael ahmed

You have got your bytes in the wrong order! In Windows the prefix is the
second byte.

aa[0] = 0xFE;
aa[1] = 0xFF;
fwrite( aa, sizeof(BYTE), 2, stream ); //write header
bb[0] = '\x28'; //unicode hex encoding
bb[1] = '\x06'; //unicode hex encoding prefix
bb[2] = '\x27'; //unicode hex encoding
bb[3] = '\x06'; //unicode hex encoding prefix
fwrite( bb, sizeof(BYTE), 4, stream ); //write data

Apart from that I don't know what to suggest, I still don't understand what
you are trying to do.

john

John Harrison · Aug 25, 2003

wael said:
sorry ,
//bytes order
this was by mistake
//i am sorry if i can not let you understand may be for my bad language
i try encode wchar according 'The Unicode Standard and ISO/IEC 10646'
this is the format which i need
please look at:
Figure 1-1. Wide ASCII
http://www.unicode.org/book/uc20ch1.html
the picture display how chars look like in binary format
You will find 'A' is = '0000 0000 0100 0001' = (in hex) '0 0 4 1'
Other char down (arabic char) = '0000 0110 0011 0011' = in hex '0 6 3 3'
the problem is if i try get the binary string of this arabic char
it is '0000 0000 1101 0011' which != '0000 0110 0011 0011'
for that i must convert '0000 0000 1101 0011' to ISO/IEC 10646'

sorry for distrub you john
and thank you for try help me

thanks

Wael, at last I think I understand you!

You have characters encoded in one character set and you want to convert
them to Unicode. Do you know which character set you have currently? I think
there are two commonly used character sets for Arabic, one is IS0-8859-6,
the other is CP1256 Do you know which you have?

Whatever you have it should just a be a simple matter of setting up a table
to convert between them.

Here is a table that converts CP1256 to Unicode

http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1256.TXT

and here's a table that converts ISO-8859-6 to Unicode

http://www.unicode.org/Public/MAPPINGS/ISO8859/8859-6.TXT

Hope this helps.
john

You have characters which are encoding using IS0-8859-6 (Arabic)

John Ericson · Aug 25, 2003

wael said:
sorry ,
//bytes order
this was by mistake
//i am sorry if i can not let you understand may be for my bad language
i try encode wchar according 'The Unicode Standard and ISO/IEC 10646'
this is the format which i need
please look at:
Figure 1-1. Wide ASCII
http://www.unicode.org/book/uc20ch1.html
the picture display how chars look like in binary format
You will find 'A' is = '0000 0000 0100 0001' = (in hex) '0 0 4 1'
Other char down (arabic char) = '0000 0110 0011 0011' = in hex '0 6 3 3'
the problem is if i try get the binary string of this arabic char
it is '0000 0000 1101 0011' which != '0000 0110 0011 0011'
for that i must convert '0000 0000 1101 0011' to ISO/IEC

10646'

<snip>

You might want to check out
http://www.dinkumware.com/libDCorX.html CoreX library
character set converters, google on Plauger and/or Pete
Becker IIRC for more info in this newsgroup (apologies if
I'm misunderstanding what you're trying to do).

- -
Best Regards, John E.

wael · Aug 26, 2003

John Ericson said:
10646'

<snip>

You might want to check out
http://www.dinkumware.com/libDCorX.html CoreX library
character set converters, google on Plauger and/or Pete
Becker IIRC for more info in this newsgroup (apologies if
I'm misunderstanding what you're trying to do).

- -
Best Regards, John E.

thank you John E.
for try help
http://www.dinkumware.com/libDCorX.html CoreX library
is this library is free >???? and for VC++ 6 ??

thankx

Upgrade from Windows-1252 to UCS-2	12	Jun 20, 2007
Unicode to characters	2	Oct 7, 2008
Converting to UCS-2 or UTF-16 for use by a C extension	0	Jun 7, 2007
Questions on various string literals in c++0x	1	Dec 7, 2010
Unicode	20	Dec 16, 2012
YAML + ASCII Encoded Unicode	1	Feb 9, 2009
Python Unicode handling wins again -- mostly	67	Nov 30, 2013
Oh! Unicode, console windows, Windows! That must be fun! :-)	11	Nov 22, 2011

unicode (UCS-2 encoded)

wael

Jason

wael

John Harrison

Jason

John Harrison

wael

John Harrison

John Harrison

John Ericson

wael

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads