unicode (UCS-2 encoded)

Discussion in 'C++' started by wael, Aug 22, 2003.

  1. wael

    wael Guest

    hello all,
    i want convert w_char to UCS2 encoded (0041) this is a char encoded UCS2
    please look at this
    http://www.unicode.org/charts/
    http://www.unicode.org/

    every language has a chart
    bye example
    char 'A' = 0041--> (UCS encoded)

    char 'any other language' = 0628 (this is in differ language)
    i hope you uderstand what i mean



    --
    thank you for take time read this
    best regards
    wael ahmed
     
    wael, Aug 22, 2003
    #1
    1. Advertising

  2. wael

    Jason Guest

    Do you mean that you want to convert locale specific strings like ASCII,
    utf8, big5, etc into unicode UCS2 two byte entities, and then store them in
    a wchar_t?

    When porting a web browser, we had a type tChar16 to put unicode into; it's
    worthwhile using typedefs anyway, even though C++ has wchar_t. We wrote our
    own language conversion libraries, it depends what you want to do. It's
    probably not so much of a C++ question. Look up internationalization on the
    web, in relation to C++
     
    Jason, Aug 22, 2003
    #2
    1. Advertising

  3. wael

    wael Guest

    "Jason" <@> wrote in message news:<3f468dcd@shknews01>...
    > Do you mean that you want to convert locale specific strings like ASCII,
    > utf8, big5, etc into unicode UCS2 two byte entities, and then store them in
    > a wchar_t?
    >
    > When porting a web browser, we had a type tChar16 to put unicode into; it's
    > worthwhile using typedefs anyway, even though C++ has wchar_t. We wrote our
    > own language conversion libraries, it depends what you want to do. It's
    > probably not so much of a C++ question. Look up internationalization on the
    > web, in relation to C++



    i have wchar_t charcter
    suppose it is (16 bit) and like this

    wchar_t ch = L'A';
    //char A encoded UCS-2 is 0041
    how do i get ch in 2 bytes UCS encoded ??


    some data stored in data base UCS-2
    0041 0065 etc ... 0628 0627 ....
    thank you for take time read this
     
    wael, Aug 24, 2003
    #3
  4. "wael" <> wrote in message
    news:...
    > "Jason" <@> wrote in message news:<3f468dcd@shknews01>...
    > > Do you mean that you want to convert locale specific strings like ASCII,
    > > utf8, big5, etc into unicode UCS2 two byte entities, and then store them

    in
    > > a wchar_t?
    > >
    > > When porting a web browser, we had a type tChar16 to put unicode into;

    it's
    > > worthwhile using typedefs anyway, even though C++ has wchar_t. We wrote

    our
    > > own language conversion libraries, it depends what you want to do. It's
    > > probably not so much of a C++ question. Look up internationalization on

    the
    > > web, in relation to C++

    >
    >
    > i have wchar_t charcter
    > suppose it is (16 bit) and like this
    >
    > wchar_t ch = L'A';
    > //char A encoded UCS-2 is 0041
    > how do i get ch in 2 bytes UCS encoded ??
    >
    >
    > some data stored in data base UCS-2
    > 0041 0065 etc ... 0628 0627 ....
    > thank you for take time read this


    Like this?

    wchar_t ch = L'A';
    char bytes[2];
    bytes[0] = ch/256; // byte[0] == 0x00
    bytes[1] = ch%256; // byte[1] == 0x41

    Still not completely clear what you are trying to do.

    john
     
    John Harrison, Aug 24, 2003
    #4
  5. wael

    Jason Guest

    I am afraid I don't understand the question either. Perhaps you can tell us
    what data your reading, and if it's in Unicode or ASCII and what you want to
    do with it. Instead of talking about wchar_t's and UCS-2 straight away.
     
    Jason, Aug 24, 2003
    #5
  6. "wael" <> wrote in message
    news:...
    > "John Harrison" <> wrote in message

    news:<bia0qk$6vmre$-berlin.de>...
    > > "wael" <> wrote in message
    > > news:...
    > > > "Jason" <@> wrote in message news:<3f468dcd@shknews01>...
    > > > > Do you mean that you want to convert locale specific strings like

    ASCII,
    > > > > utf8, big5, etc into unicode UCS2 two byte entities, and then store

    them
    > > in
    > > > > a wchar_t?
    > > > >

    > >
    > > Like this?
    > >
    > > wchar_t ch = L'A';
    > > char bytes[2];
    > > bytes[0] = ch/256; // byte[0] == 0x00
    > > bytes[1] = ch%256; // byte[1] == 0x41
    > >
    > > Still not completely clear what you are trying to do.
    > >
    > > john

    >
    > thank you for help ,
    > let me be more clear:-
    > 1- i receive text from ascii socket like this ( i will assume it is on
    > {0041 , 0628 , 0627 , 0042}
    > each 4 chars is UCS-2 encoded hex as defined in
    > http://www.unicode.org/charts/
    > 00 prifx for english 41 char as defined in chart
    > http://www.unicode.org/charts/PDF/U0000.pdf
    > 06 prifx for arabic 28 is char as defined in chart
    > http://www.unicode.org/charts/PDF/U0600.pdf
    > also i receive data for other languages like greek
    >
    > i want convert incoming string to wchar_t and from wchar_t to send it
    > by socket
    >


    OK, try again, perhaps like this?

    ascii_data is your string of unicode numbers seperated by commas and with a
    leading { and trailing }. I.e. what you read from the socket. At the end
    unicode_data is a wide string of Unicode characters, you can write that to
    your other socket.

    #include <algorithm>
    #include <istream>
    #include <sstream>
    #include <string>

    int main()
    {
    std::string ascii_data = "{0041 , 0628 , 0627 , 0042}";
    // remove leading and trailing {}
    ascii_data.erase(ascii_data.begin());
    ascii_data.erase(ascii_data.end() - 1);
    // replace commas with spaces
    std::replace(ascii_data.begin(), ascii_data.end(), ',', ' ');
    // use string as stream
    std::istringstream str(ascii_data);
    // read hex numbers from stream
    std::wstring unicode_data;
    unsigned char_value;
    while (str >> std::hex >> char_value)
    {
    unicode_data.push_back(static_cast<wchar_t>(char_value));
    }
    }

    john
     
    John Harrison, Aug 24, 2003
    #6
  7. wael

    wael Guest

    "John Harrison" <> wrote in message news:<biagg9$6f995$-berlin.de>...
    > "wael" <> wrote in message
    > news:...
    > > "John Harrison" <> wrote in message

    > news:<bia0qk$6vmre$-berlin.de>...
    > > > "wael" <> wrote in message
    > > > news:...
    > > > > "Jason" <@> wrote in message news:<3f468dcd@shknews01>...
    > > > > > Do you mean that you want to convert locale specific strings like

    > ASCII,
    > > > > > utf8, big5, etc into unicode UCS2 two byte entities, and then store

    > them
    > in
    > > > > > a wchar_t?
    > > > > >
    > > >
    > > > Like this?
    > > >
    > > > wchar_t ch = L'A';
    > > > char bytes[2];
    > > > bytes[0] = ch/256; // byte[0] == 0x00
    > > > bytes[1] = ch%256; // byte[1] == 0x41
    > > >
    > > > Still not completely clear what you are trying to do.
    > > >
    > > > john

    > >
    > > thank you for help ,
    > > let me be more clear:-
    > > 1- i receive text from ascii socket like this ( i will assume it is on
    > > {0041 , 0628 , 0627 , 0042}
    > > each 4 chars is UCS-2 encoded hex as defined in
    > > http://www.unicode.org/charts/
    > > 00 prifx for english 41 char as defined in chart
    > > http://www.unicode.org/charts/PDF/U0000.pdf
    > > 06 prifx for arabic 28 is char as defined in chart
    > > http://www.unicode.org/charts/PDF/U0600.pdf
    > > also i receive data for other languages like greek
    > >
    > > i want convert incoming string to wchar_t and from wchar_t to send it
    > > by socket
    > >

    >
    > OK, try again, perhaps like this?
    >
    > ascii_data is your string of unicode numbers seperated by commas and with a
    > leading { and trailing }. I.e. what you read from the socket. At the end
    > unicode_data is a wide string of Unicode characters, you can write that to
    > your other socket.
    >
    > #include <algorithm>
    > #include <istream>
    > #include <sstream>
    > #include <string>
    >
    > int main()
    > {
    > std::string ascii_data = "{0041 , 0628 , 0627 , 0042}";
    > // remove leading and trailing {}
    > ascii_data.erase(ascii_data.begin());
    > ascii_data.erase(ascii_data.end() - 1);
    > // replace commas with spaces
    > std::replace(ascii_data.begin(), ascii_data.end(), ',', ' ');
    > // use string as stream
    > std::istringstream str(ascii_data);
    > // read hex numbers from stream
    > std::wstring unicode_data;
    > unsigned char_value;
    > while (str >> std::hex >> char_value)
    > {
    > unicode_data.push_back(static_cast<wchar_t>(char_value));
    > }
    > }
    >
    > john



    Thank you for your help \
    the code you send me is cool but for sorry it works for english only
    look at this code:
    //this what notepad do
    BYTE aa[2];
    //header for text notepad.exe it tells notepad.exe that file type is unicode
    aa[0] = 0xFE;
    aa[1] = 0xFF;
    //end header
    // my problem is here how to encode 06 28 into wchar
    // or how notepad read this file
    //this codes is valid and if you have installed arabic language in win2000 or nt
    //you will see it fine
    //hex vaue is not like english
    // by mean hex('A') = 41 is not the same for other languages
    //by mean hex '\x28' is not the same if i get hex(ascii code of its char)
    //and this is the real problem
    BYTE bb[4]; // 4 bytes = 2 bytes unicode
    bb[0] = '\x06'; //unicode hex encoding prefix
    bb[1] = '\x28'; //unicode hex encoding
    bb[2] = '\x06'; //unicode hex encoding prefix
    bb[3] = '\x27'; //unicode hex encoding

    FILE *stream;
    stream = fopen( "c:\\fprintf.txt", "w" );
    fwrite( aa, sizeof(BYTE), 2, stream ); //write header
    fwrite( bb, sizeof(BYTE), 4, stream ); //write data
    fcloseall();//close stream


    thank you for take time read this
    wael ahmed
     
    wael, Aug 25, 2003
    #7
  8. "wael" <> wrote in message
    news:...
    > "John Harrison" <> wrote in message

    news:<biagg9$6f995$-berlin.de>...
    > > "wael" <> wrote in message
    > > news:...
    > > > "John Harrison" <> wrote in message

    > > news:<bia0qk$6vmre$-berlin.de>...
    > > > > "wael" <> wrote in message
    > > > > news:...
    > > > > > "Jason" <@> wrote in message news:<3f468dcd@shknews01>...
    > > > > > > Do you mean that you want to convert locale specific strings

    like
    > > ASCII,
    > > > > > > utf8, big5, etc into unicode UCS2 two byte entities, and then

    store
    > > them
    > > in
    > > > > > > a wchar_t?
    > > > > > >
    > > > >
    > > > > Like this?
    > > > >
    > > > > wchar_t ch = L'A';
    > > > > char bytes[2];
    > > > > bytes[0] = ch/256; // byte[0] == 0x00
    > > > > bytes[1] = ch%256; // byte[1] == 0x41
    > > > >
    > > > > Still not completely clear what you are trying to do.
    > > > >
    > > > > john
    > > >
    > > > thank you for help ,
    > > > let me be more clear:-
    > > > 1- i receive text from ascii socket like this ( i will assume it is on
    > > > {0041 , 0628 , 0627 , 0042}
    > > > each 4 chars is UCS-2 encoded hex as defined in
    > > > http://www.unicode.org/charts/
    > > > 00 prifx for english 41 char as defined in chart
    > > > http://www.unicode.org/charts/PDF/U0000.pdf
    > > > 06 prifx for arabic 28 is char as defined in chart
    > > > http://www.unicode.org/charts/PDF/U0600.pdf
    > > > also i receive data for other languages like greek
    > > >
    > > > i want convert incoming string to wchar_t and from wchar_t to send it
    > > > by socket
    > > >

    > >
    > > OK, try again, perhaps like this?
    > >
    > > ascii_data is your string of unicode numbers seperated by commas and

    with a
    > > leading { and trailing }. I.e. what you read from the socket. At the end
    > > unicode_data is a wide string of Unicode characters, you can write that

    to
    > > your other socket.
    > >
    > > #include <algorithm>
    > > #include <istream>
    > > #include <sstream>
    > > #include <string>
    > >
    > > int main()
    > > {
    > > std::string ascii_data = "{0041 , 0628 , 0627 , 0042}";
    > > // remove leading and trailing {}
    > > ascii_data.erase(ascii_data.begin());
    > > ascii_data.erase(ascii_data.end() - 1);
    > > // replace commas with spaces
    > > std::replace(ascii_data.begin(), ascii_data.end(), ',', ' ');
    > > // use string as stream
    > > std::istringstream str(ascii_data);
    > > // read hex numbers from stream
    > > std::wstring unicode_data;
    > > unsigned char_value;
    > > while (str >> std::hex >> char_value)
    > > {
    > > unicode_data.push_back(static_cast<wchar_t>(char_value));
    > > }
    > > }
    > >
    > > john

    >
    >
    > Thank you for your help \
    > the code you send me is cool but for sorry it works for english only
    > look at this code:
    > //this what notepad do
    > BYTE aa[2];
    > //header for text notepad.exe it tells notepad.exe that file type is

    unicode
    > aa[0] = 0xFE;
    > aa[1] = 0xFF;
    > //end header
    > // my problem is here how to encode 06 28 into wchar
    > // or how notepad read this file
    > //this codes is valid and if you have installed arabic language in win2000

    or nt
    > //you will see it fine
    > //hex vaue is not like english
    > // by mean hex('A') = 41 is not the same for other languages
    > //by mean hex '\x28' is not the same if i get hex(ascii code of its char)
    > //and this is the real problem
    > BYTE bb[4]; // 4 bytes = 2 bytes unicode
    > bb[0] = '\x06'; //unicode hex encoding prefix
    > bb[1] = '\x28'; //unicode hex encoding
    > bb[2] = '\x06'; //unicode hex encoding prefix
    > bb[3] = '\x27'; //unicode hex encoding
    >
    > FILE *stream;
    > stream = fopen( "c:\\fprintf.txt", "w" );
    > fwrite( aa, sizeof(BYTE), 2, stream ); //write header
    > fwrite( bb, sizeof(BYTE), 4, stream ); //write data
    > fcloseall();//close stream
    >
    >
    > thank you for take time read this
    > wael ahmed


    You have got your bytes in the wrong order! In Windows the prefix is the
    second byte.

    aa[0] = 0xFE;
    aa[1] = 0xFF;
    fwrite( aa, sizeof(BYTE), 2, stream ); //write header
    bb[0] = '\x28'; //unicode hex encoding
    bb[1] = '\x06'; //unicode hex encoding prefix
    bb[2] = '\x27'; //unicode hex encoding
    bb[3] = '\x06'; //unicode hex encoding prefix
    fwrite( bb, sizeof(BYTE), 4, stream ); //write data

    Apart from that I don't know what to suggest, I still don't understand what
    you are trying to do.

    john
     
    John Harrison, Aug 25, 2003
    #8
  9. "wael" <> wrote in message
    news:...
    > sorry ,
    > //bytes order
    > this was by mistake
    > //i am sorry if i can not let you understand may be for my bad language
    > i try encode wchar according 'The Unicode Standard and ISO/IEC 10646'
    > this is the format which i need
    > please look at:
    > Figure 1-1. Wide ASCII
    > http://www.unicode.org/book/uc20ch1.html
    > the picture display how chars look like in binary format
    > You will find 'A' is = '0000 0000 0100 0001' = (in hex) '0 0 4 1'
    > Other char down (arabic char) = '0000 0110 0011 0011' = in hex '0 6 3 3'
    > the problem is if i try get the binary string of this arabic char
    > it is '0000 0000 1101 0011' which != '0000 0110 0011 0011'
    > for that i must convert '0000 0000 1101 0011' to ISO/IEC 10646'
    >
    >
    >
    > sorry for distrub you john
    > and thank you for try help me
    >
    > thanks


    Wael, at last I think I understand you!

    You have characters encoded in one character set and you want to convert
    them to Unicode. Do you know which character set you have currently? I think
    there are two commonly used character sets for Arabic, one is IS0-8859-6,
    the other is CP1256 Do you know which you have?

    Whatever you have it should just a be a simple matter of setting up a table
    to convert between them.

    Here is a table that converts CP1256 to Unicode

    http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1256.TXT

    and here's a table that converts ISO-8859-6 to Unicode

    http://www.unicode.org/Public/MAPPINGS/ISO8859/8859-6.TXT

    Hope this helps.
    john



    You have characters which are encoding using IS0-8859-6 (Arabic)
     
    John Harrison, Aug 25, 2003
    #9
  10. wael

    John Ericson Guest

    "wael" <> wrote in message
    news:...
    > sorry ,
    > //bytes order
    > this was by mistake
    > //i am sorry if i can not let you understand may be for my

    bad language
    > i try encode wchar according 'The Unicode Standard and

    ISO/IEC 10646'
    > this is the format which i need
    > please look at:
    > Figure 1-1. Wide ASCII
    > http://www.unicode.org/book/uc20ch1.html
    > the picture display how chars look like in binary format
    > You will find 'A' is = '0000 0000 0100 0001' = (in hex) '0

    0 4 1'
    > Other char down (arabic char) = '0000 0110 0011 0011' =

    in hex '0 6 3 3'
    > the problem is if i try get the binary string of this

    arabic char
    > it is '0000 0000 1101 0011' which != '0000 0110 0011

    0011'
    > for that i must convert '0000 0000 1101 0011' to ISO/IEC

    10646'

    <snip>

    You might want to check out
    http://www.dinkumware.com/libDCorX.html CoreX library
    character set converters, google on Plauger and/or Pete
    Becker IIRC for more info in this newsgroup (apologies if
    I'm misunderstanding what you're trying to do).

    - -
    Best Regards, John E.
     
    John Ericson, Aug 25, 2003
    #10
  11. wael

    wael Guest

    "John Ericson" <> wrote in message news:<W8p2b.5781$>...
    > "wael" <> wrote in message
    > news:...
    > > sorry ,
    > > //bytes order
    > > this was by mistake
    > > //i am sorry if i can not let you understand may be for my

    > bad language
    > > i try encode wchar according 'The Unicode Standard and

    > ISO/IEC 10646'
    > > this is the format which i need
    > > please look at:
    > > Figure 1-1. Wide ASCII
    > > http://www.unicode.org/book/uc20ch1.html
    > > the picture display how chars look like in binary format
    > > You will find 'A' is = '0000 0000 0100 0001' = (in hex) '0

    > 0 4 1'
    > > Other char down (arabic char) = '0000 0110 0011 0011' =

    > in hex '0 6 3 3'
    > > the problem is if i try get the binary string of this

    > arabic char
    > > it is '0000 0000 1101 0011' which != '0000 0110 0011

    > 0011'
    > > for that i must convert '0000 0000 1101 0011' to ISO/IEC

    > 10646'
    >
    > <snip>
    >
    > You might want to check out
    > http://www.dinkumware.com/libDCorX.html CoreX library
    > character set converters, google on Plauger and/or Pete
    > Becker IIRC for more info in this newsgroup (apologies if
    > I'm misunderstanding what you're trying to do).
    >
    > - -
    > Best Regards, John E.

    thank you John E.
    for try help
    http://www.dinkumware.com/libDCorX.html CoreX library
    is this library is free >???? and for VC++ 6 ??

    thankx
     
    wael, Aug 26, 2003
    #11
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Fredrik Lundh

    codec to parse raw UCS data?

    Fredrik Lundh, Aug 19, 2003, in forum: Python
    Replies:
    1
    Views:
    303
    Oleg Leschov
    Aug 20, 2003
  2. Replies:
    9
    Views:
    425
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=
    Jun 9, 2006
  3. Anthony Baxter
    Replies:
    0
    Views:
    257
    Anthony Baxter
    Oct 12, 2006
  4. rahul
    Replies:
    0
    Views:
    261
    rahul
    Apr 27, 2009
  5. rahul
    Replies:
    2
    Views:
    280
    Gabriel Genellina
    Apr 27, 2009
Loading...

Share This Page