given char* utf8, how to read unicode line by line, and output utf8

Discussion in 'C++' started by gry, Mar 13, 2012.

  1. gry

    gry Guest

    [linux only, g++ 4.3.2]
    I need to write a function that takes a char* containing utf-8 data
    and,
    'line' by 'line', edits the corresponding unicode text, dumping utf8
    of the edited result into another
    char *. The signature would be something like:
    void doit(char *inbytes, char *outbytes){};

    The 'editing' amounts to omitting some lines and maybe changing some
    unicode characters to other characters.
    The unicode text may include 1, 2, 3, and 4 byte characters.

    My initial try looks like:
    void doit(char *inbytes, char *outbytes){
    wstringstream ins, outs;
    ins.imbue(locale("en_US.UTF8"));
    ins << inbytes;
    wstring line;
    while(line = ins.getline()) { //omit lines containing "X"
    if(line.find(L"X")) {
    outs << line << endl;
    }
    }
    strcpy(outbytes, outs.rdbuf().c_str());
    };
    };
    This gets a compiler error that:
    no matching function for call to ‘std::basic_stringstream<wchar_t,
    std::char_traits<wchar_t>, std::allocator<wchar_t> >::getline()’
    If there's a clean solution to that, and that's the only significant
    problem, that would be great.
    Since this is my first c++ dealing with unicode, I fear there may be
    more serious problems.
    Please address the intention outlined above, more than my particular
    code, assuming there is any hope at all...
    I hope dearly to solve this cleanly without piling on extra libraries,
    but I'll do what I must.
     
    gry, Mar 13, 2012
    #1
    1. Advertising

  2. Re: given char* utf8, how to read unicode line by line, and outpututf8

    On 13.03.2012 02:13, gry wrote:
    > [linux only, g++ 4.3.2]
    > I need to write a function that takes a char* containing utf-8 data
    > and,
    > 'line' by 'line', edits the corresponding unicode text, dumping utf8
    > of the edited result into another
    > char *. The signature would be something like:
    > void doit(char *inbytes, char *outbytes){};
    >
    > The 'editing' amounts to omitting some lines and maybe changing some
    > unicode characters to other characters.
    > The unicode text may include 1, 2, 3, and 4 byte characters.
    >
    > My initial try looks like:
    > void doit(char *inbytes, char *outbytes){
    > wstringstream ins, outs;
    > ins.imbue(locale("en_US.UTF8"));
    > ins<< inbytes;
    > wstring line;
    > while(line = ins.getline()) { //omit lines containing "X"
    > if(line.find(L"X")) {
    > outs<< line<< endl;
    > }
    > }
    > strcpy(outbytes, outs.rdbuf().c_str());
    > };
    > };
    > This gets a compiler error that:
    > no matching function for call to ‘std::basic_stringstream<wchar_t,
    > std::char_traits<wchar_t>, std::allocator<wchar_t> >::getline()’
    > If there's a clean solution to that, and that's the only significant
    > problem, that would be great.
    > Since this is my first c++ dealing with unicode, I fear there may be
    > more serious problems.
    > Please address the intention outlined above, more than my particular
    > code, assuming there is any hope at all...
    > I hope dearly to solve this cleanly without piling on extra libraries,
    > but I'll do what I must.


    I don't know about the locale thing. Locales names are generally not
    defined by the C and C++ standards (only "" is well-defined), and the
    whole thing sucks in general, IMHO. But instead of

    line = ins.getline()

    write

    getline( ins, line )

    Outputting a `char*` to a wide stream, as you do before that, should
    work nicely.

    However, the strcpy at the end is a mixture of char and wchar_t and
    ungood stuff. At that point, if you do things like you're doing them,
    you have to convert back from wchar_t to sequences of UTF-8 char bytes.

    Anyway, in C++ the signature should be more like

    string doit( string const& inbytes )

    I think a more reasonable approach than the streams, is to define two
    helper functions

    wstring utf32From( string const& utf8Data )

    and

    string utf16From( wstring const& utf32Data )

    and

    static_assert( sizeof( wchar_t ) == 4, "Hey, bad wchar_t size!" );

    The C++11 standard defines functions that pretty directly can do the two
    conversions above for you, I just leave it to you to read the documentation.


    Cheers & hth.,

    - Alf
     
    Alf P. Steinbach, Mar 13, 2012
    #2
    1. Advertising

  3. Re: given char* utf8, how to read unicode line by line, and outpututf8

    On 13.03.2012 05:29, Alf P. Steinbach wrote:
    > string utf16From( wstring const& utf32Data )


    sorry, typo
     
    Alf P. Steinbach, Mar 13, 2012
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. wwj
    Replies:
    7
    Views:
    590
  2. wwj
    Replies:
    24
    Views:
    2,562
    Mike Wahler
    Nov 7, 2003
  3. lovecreatesbeauty
    Replies:
    1
    Views:
    1,132
    Ian Collins
    May 9, 2006
  4. Chirag Mistry
    Replies:
    6
    Views:
    185
    Ollivier Robert
    Feb 8, 2008
  5. ~greg
    Replies:
    1
    Views:
    123
Loading...

Share This Page