Problems with UTF-8 on Windows

Discussion in 'C++' started by amandeep.bhatia1@gmail.com, Jan 11, 2007.

  1. Guest

    Hello Friends,

    I am working on a project to support internationalization for a
    existing project.

    While supporting UTF-8 I am facing a problem , while doing POC.

    I have a C string
    which I have declared as
    const char* utf8buf = "Bienvenue à l'anglais ";

    I want to supporint UTF-8 for I/0 and wchat_t strings for internal
    manipulations. So I am setting locale to setlocale(LC_CTYPE,"UTF8");
    before I start with the main code for string handling.

    Then I am using MultiByteToWideChar (using codepage as CP_UTF8) to
    convert it to wstring.

    Then again before output I am converting the string back to UTF8 format
    using WideCharToMultiByte.

    The problem is after getting back the UTF8 string after above
    conversion , when I am printing the string, I am getting "Bienvenue
    l'anglais" as output , which is not same as the input utfbuf.

    Does C++ string class support UTF-8 ?

    In real environment , we are planning to get the UTF8 strings from
    MySQL database.

    How can I correct this?

    Is there any other way in C/C++ to represent UTF8 strings?

    Thanks,
    Aman
     
    , Jan 11, 2007
    #1
    1. Advertising

  2. peter koch Guest

    skrev:
    > Hello Friends,
    >
    > I am working on a project to support internationalization for a
    > existing project.
    >
    > While supporting UTF-8 I am facing a problem , while doing POC.
    >
    > I have a C string
    > which I have declared as
    > const char* utf8buf = "Bienvenue à l'anglais ";


    The above is not valid utf-8.

    >
    > I want to supporint UTF-8 for I/0 and wchat_t strings for internal
    > manipulations. So I am setting locale to setlocale(LC_CTYPE,"UTF8");
    > before I start with the main code for string handling.


    Now we enter implementation defined territory.

    >
    > Then I am using MultiByteToWideChar (using codepage as CP_UTF8) to
    > convert it to wstring.


    And this is not C++ but Windows and thus off-topic.
    >
    > Then again before output I am converting the string back to UTF8 format
    > using WideCharToMultiByte.


    Once again off-topic.
    >
    > The problem is after getting back the UTF8 string after above
    > conversion , when I am printing the string, I am getting "Bienvenue
    > l'anglais" as output , which is not same as the input utfbuf.
    >
    > Does C++ string class support UTF-8 ?

    Well.... the short answer is no. You will have no problem storing an
    utf-8 buffer in a std::string, but accesss to individual characters is
    off: string[n] might be a character, but it could also be part of an
    escape sequence.
    >
    > In real environment , we are planning to get the UTF8 strings from
    > MySQL database.


    There is no problem getting utf-8 from a MySQL database, but I doubt
    that there is any reason to store it in a std::string (but it will not
    lead to an incorrect program).
    >
    > How can I correct this?

    Correct what? The problem with the missing á above could very well be
    related to the fact that the string above is not valid utf8, but you
    should go to the platform specific group (perhaps something like
    microsoft.public.internationalization?) for that part.
    >
    > Is there any other way in C/C++ to represent UTF8 strings?

    You can store it in a variety of ways. The most natural way for many
    applications would be to convert at APIs - for instance at the point
    you get the data from your database. If you expect to keep large
    amounts of strings in memory and if you expect UTF-8 would be a smart
    internal format, you should look for a utf8-string class. Most probably
    there will already be some nice classes out there and I vaguely
    remember having read something about utf8-strings in boost (and that is
    always the first place I look).

    /Peter
     
    peter koch, Jan 11, 2007
    #2
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. JJBW
    Replies:
    1
    Views:
    10,243
    Joerg Jooss
    Apr 24, 2004
  2. =?Utf-8?B?QXNoYQ==?=
    Replies:
    3
    Views:
    436
  3. Arifi Koseoglu
    Replies:
    2
    Views:
    985
    Arifi Koseoglu
    Apr 13, 2004
  4. Jimmy Shaw

    Converting from UTF-16 to UTF-32

    Jimmy Shaw, Jul 31, 2006, in forum: C++
    Replies:
    7
    Views:
    1,345
    P.J. Plauger
    Aug 1, 2006
  5. darrel
    Replies:
    5
    Views:
    473
    =?ISO-8859-1?Q?G=F6ran_Andersson?=
    Apr 14, 2007
Loading...

Share This Page