Wide characters and streams

Discussion in 'C++' started by =?utf-8?B?S2lyaXQgU8OmbGVuc21pbmRl?=, Sep 30, 2006.

  1. >From thread
    http://groups.google.com/group/comp.lang.c /browse_thread/thread/79d767efa42df516

    "P.J. Plauger" <> writes:
    > In practice they're not broken and you can write Unicode characters.
    > As with any other Standard C++ library, you need an appropriate
    > codecvt facet for the code conversion you favor. See our add-on
    > library, which includes a broad assortment.


    I'll take this at face value and I'll have to suppose that I don't
    understand what the streams should do.

    I guess then the root of my problem is my expectation that if I use a
    std::eek:fstream it will write a char sequence to disk and if I use a
    std::wofstream it will write a wchar_t sequence to disk. I presume then
    that this is wrong?

    I also have to assume that if I write a UTF-16 sequence to std::wcout
    then I should not expect it to display correctly on a platform that
    uses UTF-16?

    The code below summarises my expectation of what I would be able to do,
    so I guess my understanding is off. What should the code below do?

    #include "stdafx.h" // This header is empty
    #include <iostream>
    #include <conio.h>
    #include <fstream>

    int wmain(int /*argc*/, wchar_t* /*argv*/[])
    {
    std::wcout << L"Hello world!" << std::endl;
    // Surname with AE ligature
    std::wcout << L"Hello Kirit S\x00e6lensminde" << std::endl;
    // Kirit transliterated (probably badly) into Greek
    std::wcout << L"Hello \x039a\x03b9\x03c1\x03b9\x03c4" << std::endl;
    // Kirit transliterated into Thai
    std::wcout << L"Hello \x0e04\x0e35\x0e23\x0e34\x0e17" << std::endl;

    //if ( std::wcout )
    // std::cout << "\nstd::wcout still good" << std::endl;
    //else
    // std::cout << "\nstd::wcout gone bad" << std::endl;

    _cputws( L"\n\n\n" );
    _cputws( L"Hello Kirit S\x00e6lensminde\n" ); // AE ligature
    _cputws( L"Hello \x039a\x03b9\x03c1\x03b9\x03c4\n" ); // Greek
    _cputws( L"Hello \x0e04\x0e35\x0e23\x0e34\x0e17\n" ); // Thai

    std::wofstream wout1( "test1.txt" );
    wout1 << L"12345" << std::endl;

    //if ( wout1 )
    // std::cout << "\nwout1 still good" << std::endl;
    //else
    // std::cout << "\nwout1 gone bad" << std::endl;

    std::wofstream wout2( "test2.txt" );
    wout2 << L"Hello world!" << std::endl;
    wout2 << L"Hello Kirit S\x00e6lensminde" << std::endl;
    wout2 << L"Hello \x039a\x03b9\x03c1\x03b9\x03c4" << std::endl;
    wout2 << L"Hello \x0e04\x0e35\x0e23\x0e34\x0e17" << std::endl;

    //if ( wout2 )
    // std::cout << "\nwout2 still good" << std::endl;
    //else
    // std::cout << "\nwout2 gone bad" << std::endl;

    return 0;
    }


    I've compiled this on MSVC Studio 2003 and it reports the following
    command line switches on a debug build (i.e. Unicode defined as the
    character set and wchar_t as a built-in type):

    /Od /D "WIN32" /D "_DEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /Gm
    /EHsc /RTC1 /MLd /Zc:wchar_t /Zc:forScope /Yu"stdafx.h"
    /Fp"Debug/wcout.pch" /Fo"Debug/" /Fd"Debug/vc70.pdb" /W3 /nologo /c
    /Wp64 /ZI /TP

    If I run this directly from the IDE then it clearly does some odd
    narrowing of the output as the Greek cputws() line displays:

    Hello ????t

    Which to me looks like a failure in the character substitution from
    Unicode to what I presume is some OEM encoding. Now don't get wrong, I
    think this is a poor default situation for running something on a
    Unicode platform (this is on Windows 2003 Server), but it does seem to
    be beside the point for this discussion.

    If I run it from a command prompt with Unicode I/O turned on (cmd.exe
    /u) then the output is somewhat more encouraging, but not a lot:

    Hello world!
    Hello Kirit Sµlensminde
    Hello


    Hello Kirit Sælensminde
    Hello ΚιÏιτ
    Hello คีริท

    The _cputws calls all work as I would expect, but std::wcout doesn't
    work at all. Worse uncommenting the stream tests shows that there is an
    error on std::wcout rendering it unusable from then on. Note also that
    it has translated the AE ligature into what looks to me like a Greek
    lower case mu. The Greek capital kappa has wedged the stream.

    The two txt files are interesting. test1.txt is seven bytes long,
    exactly the half the size I would naively expect and test2.txt is 45
    bytes long. Exactly the length I'd expect from a char stream that only
    went up to, but didn't include, the Greek capital kappa.

    Now, if this is all by design then I presume that there is something
    fairly simple that I can do to have all of this work in the way that I
    naively expect, or does the C++ standard in some way mandate that it is
    going to be really hard? Myabe it's a quality of implementation issue
    and we just have to buy the library upgrade or write our own codecvt
    implementations?

    What we've done is to use our own implementation of a UTF-16 to UTF-8
    converter (that we know works properly as it drives our web interfaces)
    and just send that sequence to a std::eek:fstream. We've had to more or
    less give up on meaningful and pipeable console output.


    K
    =?utf-8?B?S2lyaXQgU8OmbGVuc21pbmRl?=, Sep 30, 2006
    #1
    1. Advertising

  2. =?utf-8?B?S2lyaXQgU8OmbGVuc21pbmRl?=

    P.J. Plauger Guest

    "Kirit Sælensminde" <> wrote in message
    news:...

    >From thread

    http://groups.google.com/group/comp.lang.c /browse_thread/thread/79d767efa42df516

    "P.J. Plauger" <> writes:
    > In practice they're not broken and you can write Unicode characters.
    > As with any other Standard C++ library, you need an appropriate
    > codecvt facet for the code conversion you favor. See our add-on
    > library, which includes a broad assortment.


    I'll take this at face value and I'll have to suppose that I don't
    understand what the streams should do.

    I guess then the root of my problem is my expectation that if I use a
    std::eek:fstream it will write a char sequence to disk and if I use a
    std::wofstream it will write a wchar_t sequence to disk. I presume then
    that this is wrong?

    [pjp] It's not exactly right. When you write to a wofstream, the wchar_t
    sequence you
    write gets converted to a byte sequence written to the file. How that
    conversion
    occurs depends on the codecvt facet you choose. Choose none any you get some
    default. In the case of VC++ the default is pretty stupid -- the first 256
    codes get
    written as single bytes and all other wide-character codes fail to write.

    I also have to assume that if I write a UTF-16 sequence to std::wcout
    then I should not expect it to display correctly on a platform that
    uses UTF-16?

    [pjp] Again, that depends on the codecvt facet you use. With our add-on
    library (available at our web site) we offer a host of codecvt facets.
    One of them converts UTF-16 wide characters to UTF-8 files. Another
    writes UTF-16 to UTF-16 files, with choice of endianness and an
    optional BOM that tells what kind of file it is.

    The code below summarises my expectation of what I would be able to do,
    so I guess my understanding is off. What should the code below do?

    [pjp] <Lengthy code omitted, which reaffirms the above.>

    Now, if this is all by design then I presume that there is something
    fairly simple that I can do to have all of this work in the way that I
    naively expect, or does the C++ standard in some way mandate that it is
    going to be really hard? Myabe it's a quality of implementation issue
    and we just have to buy the library upgrade or write our own codecvt
    implementations?

    [pjp] If the default codecvt facet were UTF-16 to UTF-8 you'd fine the
    behavior sensible -- for your needs. I suspect that you're in the majority
    these days, which is why we've made this the default for our Standard
    C library. But the Standard C++ library was designed to be way more
    flexible. Hence, it is in effect mandated to be hard, and it is indeed a
    QOI issue what to provide. But writing your own codecvt facets is way
    harder than it appears, so be careful.

    What we've done is to use our own implementation of a UTF-16 to UTF-8
    converter (that we know works properly as it drives our web interfaces)
    and just send that sequence to a std::eek:fstream. We've had to more or
    less give up on meaningful and pipeable console output.

    [pjp] That's one way out, yes.

    P.J. Plauger
    Dinkumware, Ltd.
    http://www.dinkumware.com
    P.J. Plauger, Oct 2, 2006
    #2
    1. Advertising

  3. P.J. Plauger wrote:
    > "Kirit Sælensminde" <> wrote in message
    > news:...
    >
    > >From thread

    > http://groups.google.com/group/comp.lang.c /browse_thread/thread/79d767efa42df516
    >
    > "P.J. Plauger" <> writes:
    > > In practice they're not broken and you can write Unicode characters.
    > > As with any other Standard C++ library, you need an appropriate
    > > codecvt facet for the code conversion you favor. See our add-on
    > > library, which includes a broad assortment.

    >
    > I'll take this at face value and I'll have to suppose that I don't
    > understand what the streams should do.
    >
    > I guess then the root of my problem is my expectation that if I use a
    > std::eek:fstream it will write a char sequence to disk and if I use a
    > std::wofstream it will write a wchar_t sequence to disk. I presume then
    > that this is wrong?
    >
    > [pjp] It's not exactly right. When you write to a wofstream, the wchar_t
    > sequence you
    > write gets converted to a byte sequence written to the file. How that
    > conversion
    > occurs depends on the codecvt facet you choose. Choose none any you get some
    > default. In the case of VC++ the default is pretty stupid -- the first 256
    > codes get
    > written as single bytes and all other wide-character codes fail to write.


    Indeed that is pretty stupid. I don't mind stupid defaults so long as
    they are described in the documentation, but the documentation of
    std::wofstream or std::wcout makes no mention of this. I notice though
    that std::wstringstream doesn't seem to suffer this problem.

    As far as std::wcout goes though there must be something else going on
    as well or the AE ligature would not have been mangled to a Greek mu.
    This would seem to imply that using a codecvt that passed through
    UTF-16 would not work or is it the existing codecvt that is performing
    the miss-transliteration?

    I can't help but think that a lot of the frustration could be very
    simply resolved by just properly documenting what the libraries do and
    putting that documentation where people will see it.

    >
    > I also have to assume that if I write a UTF-16 sequence to std::wcout
    > then I should not expect it to display correctly on a platform that
    > uses UTF-16?
    >
    > [pjp] Again, that depends on the codecvt facet you use. With our add-on
    > library (available at our web site) we offer a host of codecvt facets.
    > One of them converts UTF-16 wide characters to UTF-8 files. Another
    > writes UTF-16 to UTF-16 files, with choice of endianness and an
    > optional BOM that tells what kind of file it is.



    As a practical matter I don't understand how wchar_t streams can be
    seen as anything but broken (in the 'not working' sense) on this
    platform if I have to write my own codecvt implementation or buy one in
    so that I can write UTF-16 files.

    It seems bizarre that an assertion that the streams aren't broken is
    compatible with the fact that they cannot be used in what must be a
    very common (if not the most common) use case. An inability to write
    UTF-16 to the console sure seems broken to me and an implementation
    that writes UTF-16 streams as you describe surely can't be described as
    'working' for any practical purpose.

    >
    > The code below summarises my expectation of what I would be able to do,
    > so I guess my understanding is off. What should the code below do?
    >
    > [pjp] <Lengthy code omitted, which reaffirms the above.>
    >
    > Now, if this is all by design then I presume that there is something
    > fairly simple that I can do to have all of this work in the way that I
    > naively expect, or does the C++ standard in some way mandate that it is
    > going to be really hard? Myabe it's a quality of implementation issue
    > and we just have to buy the library upgrade or write our own codecvt
    > implementations?
    >
    > [pjp] If the default codecvt facet were UTF-16 to UTF-8 you'd fine the
    > behavior sensible -- for your needs. I suspect that you're in the majority
    > these days, which is why we've made this the default for our Standard
    > C library. But the Standard C++ library was designed to be way more
    > flexible. Hence, it is in effect mandated to be hard, and it is indeed a
    > QOI issue what to provide. But writing your own codecvt facets is way
    > harder than it appears, so be careful.


    Actually if the default codecvt was simply a null, do nothing UTF-16 to
    UTF-16 that would be fine too.

    We did notice that writing a codecvt implementation is no trivial task.
    We tried to write a UTF-16 to UTF-8 codecvt, but haven't managed to get
    it to work.

    Looking at the comments in our source it seems that there was some
    confusion about what do_length should return. I think the standard says
    it should return the number of bytes, but the documentation we were
    using at the time seemed to imply that it should return the number of
    wchar_t. The documentation we're now using looks to have been changed,
    but I'm not sure I can work out from the wording what it is saying
    should be returned.

    This is something that we may revisit.


    On your web site, is "compleat" some joke that I'm not getting?

    And thanks for taking the time to answer. It's certainly cleared up a
    lot about what is going on.


    K
    =?iso-8859-1?q?Kirit_S=E6lensminde?=, Oct 3, 2006
    #3
  4. =?utf-8?B?S2lyaXQgU8OmbGVuc21pbmRl?=

    P.J. Plauger Guest

    "Kirit Sælensminde" <> wrote in message
    news:...

    P.J. Plauger wrote:
    > "Kirit Sælensminde" <> wrote in message
    > news:...
    >
    > >From thread

    > http://groups.google.com/group/comp.lang.c /browse_thread/thread/79d767efa42df516
    >
    > "P.J. Plauger" <> writes:
    > > In practice they're not broken and you can write Unicode characters.
    > > As with any other Standard C++ library, you need an appropriate
    > > codecvt facet for the code conversion you favor. See our add-on
    > > library, which includes a broad assortment.

    >
    > I'll take this at face value and I'll have to suppose that I don't
    > understand what the streams should do.
    >
    > I guess then the root of my problem is my expectation that if I use a
    > std::eek:fstream it will write a char sequence to disk and if I use a
    > std::wofstream it will write a wchar_t sequence to disk. I presume then
    > that this is wrong?
    >
    > [pjp] It's not exactly right. When you write to a wofstream, the wchar_t
    > sequence you
    > write gets converted to a byte sequence written to the file. How that
    > conversion
    > occurs depends on the codecvt facet you choose. Choose none any you get
    > some
    > default. In the case of VC++ the default is pretty stupid -- the first 256
    > codes get
    > written as single bytes and all other wide-character codes fail to write.


    Indeed that is pretty stupid. I don't mind stupid defaults so long as
    they are described in the documentation, but the documentation of
    std::wofstream or std::wcout makes no mention of this. I notice though
    that std::wstringstream doesn't seem to suffer this problem.

    As far as std::wcout goes though there must be something else going on
    as well or the AE ligature would not have been mangled to a Greek mu.
    This would seem to imply that using a codecvt that passed through
    UTF-16 would not work or is it the existing codecvt that is performing
    the miss-transliteration?

    [pjp] The whole problem is the stupid default conversion. Our C++
    library has always used the fgetwc/fputwc machinery from the C
    library for the default wchar_t codecvt facet. Thus, we more or less
    inherit whatever decision a compiler vendor has chosen for C.
    (Unless, of course, that vendor has also licensed our C library,
    in which case you get UTF-16/UTF-8 by default.)

    But remember that what you see is also determined by the display
    software, which is outside the purview of C and C++. Sometimes
    that's not what you expect, so extended character sets get curdled
    in surprising ways on their way to your eyeballs.
    ---

    I can't help but think that a lot of the frustration could be very
    simply resolved by just properly documenting what the libraries do and
    putting that documentation where people will see it.

    [pjp] I agree that these decisions could be better highlighted.
    ---

    > I also have to assume that if I write a UTF-16 sequence to std::wcout
    > then I should not expect it to display correctly on a platform that
    > uses UTF-16?
    >
    > [pjp] Again, that depends on the codecvt facet you use. With our add-on
    > library (available at our web site) we offer a host of codecvt facets.
    > One of them converts UTF-16 wide characters to UTF-8 files. Another
    > writes UTF-16 to UTF-16 files, with choice of endianness and an
    > optional BOM that tells what kind of file it is.


    As a practical matter I don't understand how wchar_t streams can be
    seen as anything but broken (in the 'not working' sense) on this
    platform if I have to write my own codecvt implementation or buy one in
    so that I can write UTF-16 files.

    [pjp] If they don't do what you want, then they are broken to you.
    ---

    It seems bizarre that an assertion that the streams aren't broken is
    compatible with the fact that they cannot be used in what must be a
    very common (if not the most common) use case. An inability to write
    UTF-16 to the console sure seems broken to me and an implementation
    that writes UTF-16 streams as you describe surely can't be described as
    'working' for any practical purpose.

    [pjp] The common use case of today is not the one that was common
    a decade or more ago, when some of these decisions were made. The
    default conversion is doubtless overdue for revision.
    ---

    > The code below summarises my expectation of what I would be able to do,
    > so I guess my understanding is off. What should the code below do?
    >
    > [pjp] <Lengthy code omitted, which reaffirms the above.>
    >
    > Now, if this is all by design then I presume that there is something
    > fairly simple that I can do to have all of this work in the way that I
    > naively expect, or does the C++ standard in some way mandate that it is
    > going to be really hard? Myabe it's a quality of implementation issue
    > and we just have to buy the library upgrade or write our own codecvt
    > implementations?
    >
    > [pjp] If the default codecvt facet were UTF-16 to UTF-8 you'd fine the
    > behavior sensible -- for your needs. I suspect that you're in the majority
    > these days, which is why we've made this the default for our Standard
    > C library. But the Standard C++ library was designed to be way more
    > flexible. Hence, it is in effect mandated to be hard, and it is indeed a
    > QOI issue what to provide. But writing your own codecvt facets is way
    > harder than it appears, so be careful.


    Actually if the default codecvt was simply a null, do nothing UTF-16 to
    UTF-16 that would be fine too.

    [pjp] For some people.
    ---

    We did notice that writing a codecvt implementation is no trivial task.
    We tried to write a UTF-16 to UTF-8 codecvt, but haven't managed to get
    it to work.

    [pjp] It's the hardest codecvt facet of all to write. In fact, it's
    officially impossible, since codecvt was "designed" to do 1-N code
    conversions, and UTF-16/UTF-8 is M-N. No Standard C++ library except
    ours will even give you a fighting chance, and it's a fiendishly
    difficult coding problem even then.
    ---

    Looking at the comments in our source it seems that there was some
    confusion about what do_length should return. I think the standard says
    it should return the number of bytes, but the documentation we were
    using at the time seemed to imply that it should return the number of
    wchar_t. The documentation we're now using looks to have been changed,
    but I'm not sure I can work out from the wording what it is saying
    should be returned.

    [pjp] The description of codecvt in the C++ Standard is murky, to
    put it politely.
    ---

    This is something that we may revisit.

    On your web site, is "compleat" some joke that I'm not getting?

    [pjp] "Compleat" is an older spelling of "complete". See, for
    example, the noted 17th century book, "The Compleat Angler or
    the Contemplative man's Recreation."
    ---

    And thanks for taking the time to answer. It's certainly cleared up a
    lot about what is going on.

    [pjp] Welcome.
    ---

    P.J. Plauger
    Dinkumware, Ltd.
    http://www.dinkumware.com
    P.J. Plauger, Oct 3, 2006
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Web Developer

    char 8bit wide or 7bit wide in c++?

    Web Developer, Jul 31, 2003, in forum: C++
    Replies:
    2
    Views:
    573
    John Harrison
    Jul 31, 2003
  2. Dave

    Narrow Vs. wide streams

    Dave, Jan 29, 2005, in forum: C++
    Replies:
    3
    Views:
    371
    Efrat Regev
    Jan 30, 2005
  3. Disc Magnet
    Replies:
    2
    Views:
    696
    Jukka K. Korpela
    May 15, 2010
  4. Disc Magnet
    Replies:
    2
    Views:
    779
    Neredbojias
    May 14, 2010
  5. Martin Rinehart

    80 columns wide? 132 columns wide?

    Martin Rinehart, Oct 31, 2008, in forum: Javascript
    Replies:
    16
    Views:
    161
    John W Kennedy
    Nov 13, 2008
Loading...

Share This Page