How should I handle the multibyte char set string in C++?

Discussion in 'C++' started by Dancefire, Apr 29, 2007.

  1. Dancefire

    Dancefire Guest

    Hi, everyone,

    I'm writing a program using wstring(wchar_t) as internal string.

    The problem is raised when I convert the multibyte char set string
    with different encoding to wstring(which is Unicode, UCS-2LE(BMP) in
    Win32, and UCS4 in Linux?).

    I have 2 ways to do the job:

    1) use std::locale, set std::locale::global() and use mbstowcs() and
    wcstombs() do the conversion.

    2) use platform dependent functions to do the job, such as libiconv in
    Linux, or MultiByteToWideChar() and WideCharToMultiByte() in Win32.

    At first glance, it might be definitely to choose the solution 1) to
    do the job. Since it's really C++ favor, and in details, the codecvt
    facet is actually wrap the function by calling libiconv in Linux, and
    MultiByteToWideChar() or WideCharToMultiByte() in Win32 (by different
    STL implementation) to do the real job.(if my understanding is
    correct).

    However, I have 2 problems.

    First, I have to set the global locale before I do the conversion.

    There are 2 side effects, the first effect is when I do the multi-
    thread program, changing the global setting will affect the other
    thread using different encoding to do the conversion. Yes, I can lock
    the conversion, but it make no sense to do, and cause really low
    performance.

    The second effect is every time I set std::locale::global() is time
    consuming, create a locale object and set it to global locale is not a
    light job, it does cause a low performance.

    Second problem, looks like the system dependent conversion functions
    support much more encoding than std::locale() by each STL
    implementation. For example, libiconv support UCS-2LE encoding, but g+
    +'s locale() doesn't support it. MultiByteToWideChar() support UTF8
    conversion, but MSVC(8.0)'s STL std::locale() doesn't support ".65001"
    for code page 65001 which is UTF8.

    The locale string is not same on different platform might be the third
    problem, but I can easily ignore it by #ifdef #endif.

    So, back to beginning question, how should I handle the MBCS string in
    C++?

    Thanks.
     
    Dancefire, Apr 29, 2007
    #1
    1. Advertising

  2. Dancefire

    James Kanze Guest

    On Apr 29, 4:40 pm, Dancefire <> wrote:

    > I'm writing a program using wstring(wchar_t) as internal string.


    > The problem is raised when I convert the multibyte char set string
    > with different encoding to wstring(which is Unicode, UCS-2LE(BMP) in
    > Win32, and UCS4 in Linux?).


    > I have 2 ways to do the job:


    > 1) use std::locale, set std::locale::global() and use mbstowcs() and
    > wcstombs() do the conversion.


    Why not std::codecvt? A facet which you can obtain from a
    locale.

    > 2) use platform dependent functions to do the job, such as libiconv in
    > Linux, or MultiByteToWideChar() and WideCharToMultiByte() in Win32.


    > At first glance, it might be definitely to choose the solution 1) to
    > do the job. Since it's really C++ favor, and in details, the codecvt
    > facet is actually wrap the function by calling libiconv in Linux, and
    > MultiByteToWideChar() or WideCharToMultiByte() in Win32 (by different
    > STL implementation) to do the real job.(if my understanding is
    > correct).


    > However, I have 2 problems.


    > First, I have to set the global locale before I do the conversion.


    Why? You can get a facet from any locale. That's the one
    advantage C++ locales have over the C stuff.

    [...]
    > Second problem, looks like the system dependent conversion functions
    > support much more encoding than std::locale() by each STL
    > implementation.


    That's a problem with the C++ library implementation. A quality
    implementation will support all of the code sets that are
    installed on the system.

    > For example, libiconv support UCS-2LE encoding, but g++'s
    > locale() doesn't support it. MultiByteToWideChar() support
    > UTF8 conversion, but MSVC(8.0)'s STL std::locale() doesn't
    > support ".65001" for code page 65001 which is UTF8.


    Finding what locales are available and work can be a bit of a
    game:). And how they are named, if you're not under Unix.

    > The locale string is not same on different platform might be the third
    > problem, but I can easily ignore it by #ifdef #endif.


    > So, back to beginning question, how should I handle the MBCS string in
    > C++?


    The official answer is std::codecvt. In practice, I roll my
    own:).

    --
    James Kanze (Gabi Software) email:
    Conseils en informatique orientée objet/
    Beratung in objektorientierter Datenverarbeitung
    9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
     
    James Kanze, Apr 29, 2007
    #2
    1. Advertising

  3. Dancefire

    Dancefire Guest

    > Why not std::codecvt? A facet which you can obtain from a
    > locale.


    oops, I miss the std::codecvt. Thank you.

    After I tried std::codecvt, I have 2 more questions.

    1) Should we initialize mbstate_t variable? And how to initialize the
    mbstate_t portable and in C++ way?

    Many sample code I saw on the net, didn't initialize the mbstate_t
    variable. Such as:

    http://incubator.apache.org/stdcxx/doc/stdlibref/codecvt.html#sec12

    std::mbstate_t state;

    And sample in MSDN with Visual Studio 2005.

    mbstate_t state;

    They just declare it and use it, never assign any initial value to the
    state. And I did get a problem in VC80 without initialize the state to
    zero during I try (the first character always mass up in debug mode,
    the follow up is ok).

    But the online version of MSDN do initialize the mbstate_t variable:
    http://msdn2.microsoft.com/en-us/library/xse90h58(VS.80).aspx

    mbstate_t state = {0};

    And I do find a code using memset() to set all range to zero, but I
    don't think it's c++'s way.
    How should I make the initial portable?

    2) I can know the wchar_t* buf length for codecvt.in() by
    codecvt.length(), but how should I know the char * buffer length for
    codecvt.out()?

    I can pass 0 pointer to mbstowcs() or wcstombs() to get the length of
    the output buffer I need. but I don't know how to do the same thing by
    using codecvt<>.

    > > For example, libiconv support UCS-2LE encoding, but g++'s
    > > locale() doesn't support it. MultiByteToWideChar() support
    > > UTF8 conversion, but MSVC(8.0)'s STL std::locale() doesn't
    > > support ".65001" for code page 65001 which is UTF8.

    >
    > Finding what locales are available and work can be a bit of a
    > game:). And how they are named, if you're not under Unix.
    >


    I use "locale -l" list all the locale string supportted in Linux, and
    use the following link to find the locale string in Windows:

    http://msdn2.microsoft.com/en-us/library/hzz3tw78(vs.80).aspx

    However, I still cannot handle "UCS-2"/"UTF16" in Linux or
    "UTF8"/"UTF16" in Windows by std::locale. Do you know how can I do
    this?

    >
    > The official answer is std::codecvt. In practice, I roll my
    > own:).
    >



    Thanks again, you do help me.
     
    Dancefire, Apr 30, 2007
    #3
  4. Dancefire

    Guest

    On Apr 30, 4:56 am, Dancefire <> wrote:
    [...]
    > 1) Should we initialize mbstate_t variable? And how to initialize the
    > mbstate_t portable and in C++ way?
    >
    > Many sample code I saw on the net, didn't initialize the mbstate_t
    > variable. Such as:
    >
    > http://incubator.apache.org/stdcxx/doc/stdlibref/codecvt.html#sec12
    >
    > std::mbstate_t state;


    Strictly speaking you should zero-initialize the state. It doesn't
    matter
    in the trivial example shown in the Apache stdcxx documentation but
    in general the state must be either zeroed out (i.e., to represent the
    initial shift state) or be the result of a prior conversion.

    I have corrected the example program to initialize the state variable,
    see: http://svn.apache.org/viewvc?view=rev&revision=533806. I'll fix
    the docs next.

    >

    [...]
    > mbstate_t state = {0};
    >
    > And I do find a code using memset() to set all range to zero, but I
    > don't think it's c++'s way.
    > How should I make the initial portable?


    Like so:

    mbstate_t state = mbstate_t ();

    >
    > 2) I can know the wchar_t* buf length for codecvt.in() by
    > codecvt.length(), but how should I know the char * buffer length for
    > codecvt.out()?


    codecvt::length() returns the number of extern_type characters (i.e.,
    narrow chars for codecvt<wchar_t, char>).

    >

    [...]
    > However, I still cannot handle "UCS-2"/"UTF16" in Linux or
    > "UTF8"/"UTF16" in Windows by std::locale. Do you know how can I do
    > this?


    In the Apache C++ Standard Library you can do it using
    a codecvt_byname facet constructed with the name "UTF-8@UCS"
    as an argument, although it's not mentioned on the documentation page:
    http://incubator.apache.org/stdcxx/doc/stdlibref/codecvt-byname.html
    Let me look into adding it.
     
    , Apr 30, 2007
    #4
  5. Dancefire

    Dancefire Guest

    > > 1) Should we initialize mbstate_t variable? And how to initialize the
    > > mbstate_t portable and in C++ way?

    >
    > > Many sample code I saw on the net, didn't initialize the mbstate_t
    > > variable. Such as:

    >
    > >http://incubator.apache.org/stdcxx/doc/stdlibref/codecvt.html#sec12

    >
    > > std::mbstate_t state;

    >
    > Strictly speaking you should zero-initialize the state. It doesn't
    > matter
    > in the trivial example shown in the Apache stdcxx documentation but
    > in general the state must be either zeroed out (i.e., to represent the
    > initial shift state) or be the result of a prior conversion.
    >
    > I have corrected the example program to initialize the state variable,
    > see:http://svn.apache.org/viewvc?view=rev&revision=533806. I'll fix
    > the docs next.
    >
    >


    Yes, the example in Apache stdcxx documentation works, since it
    doesn't try to handle MBCS in CJK encoding. If the state is not zero,
    the code will get problem to handle MBCS string, and the first 1-2
    bytes in the MBCS will parse to a wrong result if they a greater than
    0x80, and the follow up byte might be parsed correct, and if the first
    1-2 char is < 0x80, it might just simply return with an error.

    Thank you very much for correct the code and the doc, it will make
    others much clear and avoid the problem I faced.

    > [...]
    > > mbstate_t state = {0};

    >
    > > And I do find a code using memset() to set all range to zero, but I
    > > don't think it's c++'s way.
    > > How should I make the initial portable?

    >
    > Like so:
    >
    > mbstate_t state = mbstate_t ();
    >


    I get it, thank you very much.

    >
    >
    > > 2) I can know the wchar_t* buf length for codecvt.in() by
    > > codecvt.length(), but how should I know the char * buffer length for
    > > codecvt.out()?

    >
    > codecvt::length() returns the number of extern_type characters (i.e.,
    > narrow chars for codecvt<wchar_t, char>).
    >


    I'm a little confuse here, even after read the document. Could you
    give me a piece of code as example how to do same thing as below's
    code:

    ===================================
    string str("\xba\xba\xd6\xd7");
    size_t len = mbstowcs(0, str, str.length());
    wchar_t* wstr = new wchar_t[len+1];
    mbstowcs(wstr, str, len);
    ===================================
    And the reverse version:

    ===================================
    wstring wstr(L"\xbaba\xd6d7");
    size_t len = wcstombs(0, wstr, wstr.length());
    char* str = new char[len+1];
    wcstombs(str, wstr, len);
    ===================================

    The point is I need to get the length for the output buffer, so I can
    new the buffer in a safe way. How can I get the buffer's length for
    both codecvt::in() and codecvt::eek:ut()?

    BTW, am I correct in above code? I mean at the second time call for
    wcstombs() or mbstowcs() which use "len" as the length rather than as
    the first call which are use "wstr.length()" or "str.length()" as the
    length?

    >
    > [...]
    > > However, I still cannot handle "UCS-2"/"UTF16" in Linux or
    > > "UTF8"/"UTF16" in Windows by std::locale. Do you know how can I do
    > > this?

    >
    > In the Apache C++ Standard Library you can do it using
    > a codecvt_byname facet constructed with the name "UTF-8@UCS"
    > as an argument, although it's not mentioned on the documentation page:http://incubator.apache.org/stdcxx/doc/stdlibref/codecvt-byname.html
    > Let me look into adding it.


    Thank you, I know how to handle this in Apache C++ Standard Library
    now. I will try that.
    Do you know the how can I use g++'s STL do this? I mean, conversion
    between wchar_t*, which contain UCS-4 string, and char*, which contain
    UCS-2 or UTF16 string.

    The problem is raised when I try to do a project can be portable
    between Windows and Linux. I try to write the unicode string to a
    file.

    When I choose UTF8 to write, I get 2 problems,

    1) VC80's STL doesn't support UTF8's locale, (althought Win32 api
    support it, but use win32 api will make some of the code non-portable)
    2) All of the string is CJK characters, so UTF8 will cost at least 3
    bytes to store, enlarge 50% for storage which is unnecessary if I
    store just use UCS-2. And I'm sure all the characters is in BMP of
    ISO-10646. So I'd better just use 16bit to store it in the file.

    However, If I choose UCS2LE, just like what stored in wchar_t in VC, I
    got problem of reading the file at Linux, which g++'s STL looks like
    doesn't support UCS-2LE locale, and wchar_t in Linux is UCS4 rather
    than UCS2, so I cannot directly read the content. (same kind of story,
    since libiconv support UCS-2LE, but if I use libiconv it will make the
    part of the code non-portable and I have to let mycode depends on
    libiconv).

    So, What should I do in this case?
     
    Dancefire, May 1, 2007
    #5
  6. Dancefire

    P.J. Plauger Guest

    "Dancefire" <> wrote in message
    news:...

    > .....
    >> [...]
    >> > However, I still cannot handle "UCS-2"/"UTF16" in Linux or
    >> > "UTF8"/"UTF16" in Windows by std::locale. Do you know how can I do
    >> > this?

    >>
    >> In the Apache C++ Standard Library you can do it using
    >> a codecvt_byname facet constructed with the name "UTF-8@UCS"
    >> as an argument, although it's not mentioned on the documentation
    >> page:http://incubator.apache.org/stdcxx/doc/stdlibref/codecvt-byname.html
    >> Let me look into adding it.

    >
    > Thank you, I know how to handle this in Apache C++ Standard Library
    > now. I will try that.
    > Do you know the how can I use g++'s STL do this? I mean, conversion
    > between wchar_t*, which contain UCS-4 string, and char*, which contain
    > UCS-2 or UTF16 string.
    >
    > The problem is raised when I try to do a project can be portable
    > between Windows and Linux. I try to write the unicode string to a
    > file.
    >
    > When I choose UTF8 to write, I get 2 problems,
    >
    > 1) VC80's STL doesn't support UTF8's locale, (althought Win32 api
    > support it, but use win32 api will make some of the code non-portable)
    > 2) All of the string is CJK characters, so UTF8 will cost at least 3
    > bytes to store, enlarge 50% for storage which is unnecessary if I
    > store just use UCS-2. And I'm sure all the characters is in BMP of
    > ISO-10646. So I'd better just use 16bit to store it in the file.
    >
    > However, If I choose UCS2LE, just like what stored in wchar_t in VC, I
    > got problem of reading the file at Linux, which g++'s STL looks like
    > doesn't support UCS-2LE locale, and wchar_t in Linux is UCS4 rather
    > than UCS2, so I cannot directly read the content. (same kind of story,
    > since libiconv support UCS-2LE, but if I use libiconv it will make the
    > part of the code non-portable and I have to let mycode depends on
    > libiconv).
    >
    > So, What should I do in this case?


    Everything you need is included in our Compleat Libraries, for both
    VC++ and gcc. But they cost $.

    P.J. Plauger
    Dinkumware, Ltd.
    http://www.dinkumware.com
     
    P.J. Plauger, May 1, 2007
    #6
  7. Dancefire

    Dancefire Guest

    On May 1, 7:46 pm, "P.J. Plauger" <> wrote:
    > "Dancefire" <> wrote in message
    >
    > news:...
    >
    >
    >
    > > .....
    > >> [...]
    > >> > However, I still cannot handle "UCS-2"/"UTF16" in Linux or
    > >> > "UTF8"/"UTF16" in Windows by std::locale. Do you know how can I do
    > >> > this?

    >
    > >> In the Apache C++ Standard Library you can do it using
    > >> a codecvt_byname facet constructed with the name "UTF-8@UCS"
    > >> as an argument, although it's not mentioned on the documentation
    > >> page:http://incubator.apache.org/stdcxx/doc/stdlibref/codecvt-byname.html
    > >> Let me look into adding it.

    >
    > > Thank you, I know how to handle this in Apache C++ Standard Library
    > > now. I will try that.
    > > Do you know the how can I use g++'s STL do this? I mean, conversion
    > > between wchar_t*, which contain UCS-4 string, and char*, which contain
    > > UCS-2 or UTF16 string.

    >
    > > The problem is raised when I try to do a project can be portable
    > > between Windows and Linux. I try to write the unicode string to a
    > > file.

    >
    > > When I choose UTF8 to write, I get 2 problems,

    >
    > > 1) VC80's STL doesn't support UTF8's locale, (althought Win32 api
    > > support it, but use win32 api will make some of the code non-portable)
    > > 2) All of the string is CJK characters, so UTF8 will cost at least 3
    > > bytes to store, enlarge 50% for storage which is unnecessary if I
    > > store just use UCS-2. And I'm sure all the characters is in BMP of
    > > ISO-10646. So I'd better just use 16bit to store it in the file.

    >
    > > However, If I choose UCS2LE, just like what stored in wchar_t in VC, I
    > > got problem of reading the file at Linux, which g++'s STL looks like
    > > doesn't support UCS-2LE locale, and wchar_t in Linux is UCS4 rather
    > > than UCS2, so I cannot directly read the content. (same kind of story,
    > > since libiconv support UCS-2LE, but if I use libiconv it will make the
    > > part of the code non-portable and I have to let mycode depends on
    > > libiconv).

    >
    > > So, What should I do in this case?

    >
    > Everything you need is included in our Compleat Libraries, for both
    > VC++ and gcc. But they cost $.
    >
    > P.J. Plauger
    > Dinkumware, Ltd.http://www.dinkumware.com



    Yes, the Compleat Libraries is cool. but before I pay it, I need to
    make sure there is no way to do it easily.
    I'm developing an open source project, for portability reason, I'd
    better depends on existing STL in VC80 Express for windows, and libstdc
    ++ for Linux(or other).
    I'm trying to find the common encoding for Unicode in both VC80
    Express STL and libstdc++.
     
    Dancefire, May 1, 2007
    #7
  8. Dancefire

    P.J. Plauger Guest

    "Dancefire" <> wrote in message
    news:...

    > On May 1, 7:46 pm, "P.J. Plauger" <> wrote:
    >> "Dancefire" <> wrote in message
    >>
    >> news:...
    >>
    >>
    >>
    >> > .....
    >> >> [...]
    >> >> > However, I still cannot handle "UCS-2"/"UTF16" in Linux or
    >> >> > "UTF8"/"UTF16" in Windows by std::locale. Do you know how can I do
    >> >> > this?

    >>
    >> >> In the Apache C++ Standard Library you can do it using
    >> >> a codecvt_byname facet constructed with the name "UTF-8@UCS"
    >> >> as an argument, although it's not mentioned on the documentation
    >> >> page:http://incubator.apache.org/stdcxx/doc/stdlibref/codecvt-byname.html
    >> >> Let me look into adding it.

    >>
    >> > Thank you, I know how to handle this in Apache C++ Standard Library
    >> > now. I will try that.
    >> > Do you know the how can I use g++'s STL do this? I mean, conversion
    >> > between wchar_t*, which contain UCS-4 string, and char*, which contain
    >> > UCS-2 or UTF16 string.

    >>
    >> > The problem is raised when I try to do a project can be portable
    >> > between Windows and Linux. I try to write the unicode string to a
    >> > file.

    >>
    >> > When I choose UTF8 to write, I get 2 problems,

    >>
    >> > 1) VC80's STL doesn't support UTF8's locale, (althought Win32 api
    >> > support it, but use win32 api will make some of the code non-portable)
    >> > 2) All of the string is CJK characters, so UTF8 will cost at least 3
    >> > bytes to store, enlarge 50% for storage which is unnecessary if I
    >> > store just use UCS-2. And I'm sure all the characters is in BMP of
    >> > ISO-10646. So I'd better just use 16bit to store it in the file.

    >>
    >> > However, If I choose UCS2LE, just like what stored in wchar_t in VC, I
    >> > got problem of reading the file at Linux, which g++'s STL looks like
    >> > doesn't support UCS-2LE locale, and wchar_t in Linux is UCS4 rather
    >> > than UCS2, so I cannot directly read the content. (same kind of story,
    >> > since libiconv support UCS-2LE, but if I use libiconv it will make the
    >> > part of the code non-portable and I have to let mycode depends on
    >> > libiconv).

    >>
    >> > So, What should I do in this case?

    >>
    >> Everything you need is included in our Compleat Libraries, for both
    >> VC++ and gcc. But they cost $.
    >>
    >> P.J. Plauger
    >> Dinkumware, Ltd.http://www.dinkumware.com

    >
    >
    > Yes, the Compleat Libraries is cool. but before I pay it, I need to
    > make sure there is no way to do it easily.
    > I'm developing an open source project, for portability reason, I'd
    > better depends on existing STL in VC80 Express for windows, and libstdc
    > ++ for Linux(or other).
    > I'm trying to find the common encoding for Unicode in both VC80
    > Express STL and libstdc++.


    Well, you can encode Unicode as:

    -- UTF-8 in an array of char

    -- UTF-16 in an array of short (or wchar_t under VC++)

    -- UCS-2 in an array of short (if you're willing to settle for the common
    65K Unicode subset)

    -- UTF-32 or UCS-4 in an array of long (or wchar_t under gcc)

    We supply a whole slew of interconversions between these forms, and
    the appropriate endian versions in files, in our Code Conversions
    library (part of the Compleat Libraries). See:

    file:///C:/htm_cplt/temp/index_cvt.html

    for an essay on code conversions and the list of facets we supply.

    P.J. Plauger
    Dinkumware, Ltd.
    http://www.dinkumware.com
     
    P.J. Plauger, May 1, 2007
    #8
  9. Dancefire

    Dancefire Guest

    >
    > Well, you can encode Unicode as:
    >
    > -- UTF-8 in an array of char
    >
    > -- UTF-16 in an array of short (or wchar_t under VC++)
    >
    > -- UCS-2 in an array of short (if you're willing to settle for the common
    > 65K Unicode subset)
    >
    > -- UTF-32 or UCS-4 in an array of long (or wchar_t under gcc)
    >
    > We supply a whole slew of interconversions between these forms, and
    > the appropriate endian versions in files, in our Code Conversions
    > library (part of the Compleat Libraries). See:
    >
    > file:///C:/htm_cplt/temp/index_cvt.html
    >
    > for an essay on code conversions and the list of facets we supply.
    >
    > P.J. Plauger
    > Dinkumware, Ltd.http://www.dinkumware.com


    Thanks, but I can't see the link, it's local...

    And one more question about the codecvt. I'm not familiar with
    codecvt, I need some help here.

    > > 2) I can know the wchar_t* buf length for codecvt.in() by
    > > codecvt.length(), but how should I know the char * buffer length for
    > > codecvt.out()?


    > codecvt::length() returns the number of extern_type characters (i.e.,
    > narrow chars for codecvt<wchar_t, char>).


    I'm a little confuse here, even after read the document. Could you
    give me a piece of code as example how to do same thing as below's
    code:

    ===================================
    string str("\xba\xba\xd6\xd7");
    size_t len = mbstowcs(0, str, str.length());
    wchar_t* wstr = new wchar_t[len+1];
    mbstowcs(wstr, str, len);
    ===================================
    And the reverse version:

    ===================================
    wstring wstr(L"\xbaba\xd6d7");
    size_t len = wcstombs(0, wstr, wstr.length());
    char* str = new char[len+1];
    wcstombs(str, wstr, len);
    ===================================

    The point is I need to get the length for the output buffer, so I can
    new the buffer in a safe way. How can I get the buffer's length for
    both codecvt::in() and codecvt::eek:ut()?

    BTW, am I correct in above code? I mean at the second time call for
    wcstombs() or mbstowcs() which use "len" as the length rather than as
    the first call which are use "wstr.length()" or "str.length()" as the
    length?

    Thanks
     
    Dancefire, May 2, 2007
    #9
  10. Dancefire

    Guest

    On May 1, 1:18 am, Dancefire <> wrote:
    [...]
    > I'm a little confuse here, even after read the document. Could you
    > give me a piece of code as example how to do same thing as below's
    > code:


    I don't blame you for being confused. You can't use length() for this
    (or for much else, I'm afraid). It's really not a very useful
    function.

    >
    > ===================================
    > string str("\xba\xba\xd6\xd7");
    > size_t len = mbstowcs(0, str, str.length());
    > wchar_t* wstr = new wchar_t[len+1];
    > mbstowcs(wstr, str, len);


    Here's an implementation of mbstowcs() using codecvt. I'll probably
    put it up on the Apache stdcxx site or include it in the documentation
    but I'm pasting it here for reference (let me know if you run into any
    problems with it). The reverse (i.e., wcstombs()) is analogous and
    I'll leave its implementation as an exercise for interested
    readers ;-)

    std::size_t
    my_mbstowcs (std::mbstate_t *pstate,
    wchar_t *dst,
    const char *src,
    std::size_t size)
    {
    const std::locale global;

    typedef std::codecvt<wchar_t, char, std::mbstate_t> CodeCvt;

    // retrieve the codecvt facet from the global locale
    const CodeCvt &cvt = std::use_facet<CodeCvt>(global);

    // use local shift state when pstate is null
    std::mbstate_t state = std::mbstate_t ();
    if (0 == pstate)
    pstate = &state;

    // use a small local buffer when dst is null and ignore size
    wchar_t buf [32];
    if (0 == dst) {
    dst = buf;
    size = sizeof buf / sizeof *buf;
    }

    const char *from = src;
    const char *from_end = from + std::strlen (from);
    const char *from_next = from;

    wchar_t *to = dst;
    wchar_t *to_end = to + size;
    wchar_t *to_next;

    // number of non-NUL wide characters stored in destination buffer
    std::size_t nconv = 0;

    for ( ; from_next != from_end && to_next != to_end; ) {

    const std::codecvt_base::result res =
    cvt.in (*pstate,
    from, from_end, from_next,
    to, to_end, to_next);

    switch (res) {

    case std::codecvt_base::error:
    return std::size_t (-1);

    case std::codecvt_base::noconv:
    // should not happen (bad codecvt facet)
    return std::size_t (-1);

    case std::codecvt_base::eek:k:
    case std::codecvt_base::partial:

    nconv += to_next - to;

    if (from_next == from || dst != buf)
    return nconv;

    from = from_next;
    to = dst;
    to_end = dst + size;

    break;
    }
    }

    return nconv;
    }

    [...]
    > BTW, am I correct in above code? I mean at the second time call for
    > wcstombs() or mbstowcs() which use "len" as the length rather than as
    > the first call which are use "wstr.length()" or "str.length()" as the
    > length?


    I don't think that's correct. The last argument specifies the size of
    the
    destination buffer.

    >

    [...]
    > Thank you, I know how to handle this in Apache C++ Standard Library
    > now. I will try that.
    > Do you know the how can I use g++'s STL do this? I mean, conversion
    > between wchar_t*, which contain UCS-4 string, and char*, which contain
    > UCS-2 or UTF16 string.


    You should be able to use the same code to convert between UCS and
    UTF-8 across all implementations. The only thing that may be different
    is the name of the locale. I don't know of a portable way to do UTF-16
    (not to be confused with UCS-2), or UCS-2 on platforms where wchar_t
    isn't 2 bytes wide (or, conversely, UCS-4 where wchar_t is 2 bytes).
     
    , May 4, 2007
    #10
  11. Dancefire

    Guest

    On May 4, 11:22 am, "" <> wrote:
    [...]
    > Here's an implementation of mbstowcs() using codecvt. I'll probably
    > put it up on the Apache stdcxx site or include it in the documentation
    > but I'm pasting it here for reference (let me know if you run into any
    > problems with it).


    I corrected a number of bugs in the code, added a few comments,
    and checked the example into our repository:
    http://svn.apache.org/repos/asf/incubator/stdcxx/trunk/examples/manual/mbsrtowcs.cpp
     
    , May 5, 2007
    #11
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Billow
    Replies:
    2
    Views:
    1,360
    Billow
    Dec 1, 2005
  2. kaith

    java multibyte char array

    kaith, Aug 21, 2003, in forum: Java
    Replies:
    3
    Views:
    3,891
    Jon A. Cruz
    Aug 21, 2003
  3. Zygmunt Krynicki

    Multibyte string length

    Zygmunt Krynicki, Oct 9, 2003, in forum: C Programming
    Replies:
    19
    Views:
    723
    Dan Pop
    Oct 14, 2003
  4. lovecreatesbeauty
    Replies:
    1
    Views:
    1,096
    Ian Collins
    May 9, 2006
  5. Owner

    How to determine Multibyte string length.

    Owner, Apr 9, 2011, in forum: C Programming
    Replies:
    4
    Views:
    834
    Ben Bacarisse
    Apr 11, 2011
Loading...

Share This Page