given char* utf8, how to read unicode line by line, and output utf8

gry · Mar 13, 2012

[linux only, g++ 4.3.2]
I need to write a function that takes a char* containing utf-8 data
and,
'line' by 'line', edits the corresponding unicode text, dumping utf8
of the edited result into another
char *. The signature would be something like:
void doit(char *inbytes, char *outbytes){};

The 'editing' amounts to omitting some lines and maybe changing some
unicode characters to other characters.
The unicode text may include 1, 2, 3, and 4 byte characters.

My initial try looks like:
void doit(char *inbytes, char *outbytes){
wstringstream ins, outs;
ins.imbue(locale("en_US.UTF8"));
ins << inbytes;
wstring line;
while(line = ins.getline()) { //omit lines containing "X"
if(line.find(L"X")) {
outs << line << endl;
}
}
strcpy(outbytes, outs.rdbuf().c_str());
};
};
This gets a compiler error that:
no matching function for call to ‘std::basic_stringstream<wchar_t,
std::char_traits<wchar_t>, std::allocator<wchar_t> >::getline()’
If there's a clean solution to that, and that's the only significant
problem, that would be great.
Since this is my first c++ dealing with unicode, I fear there may be
more serious problems.
Please address the intention outlined above, more than my particular
code, assuming there is any hope at all...
I hope dearly to solve this cleanly without piling on extra libraries,
but I'll do what I must.

Alf P. Steinbach · Mar 13, 2012

[linux only, g++ 4.3.2]
I need to write a function that takes a char* containing utf-8 data
and,
'line' by 'line', edits the corresponding unicode text, dumping utf8
of the edited result into another
char *. The signature would be something like:
void doit(char *inbytes, char *outbytes){};

The 'editing' amounts to omitting some lines and maybe changing some
unicode characters to other characters.
The unicode text may include 1, 2, 3, and 4 byte characters.

My initial try looks like:
void doit(char *inbytes, char *outbytes){
wstringstream ins, outs;
ins.imbue(locale("en_US.UTF8"));
ins<< inbytes;
wstring line;
while(line = ins.getline()) { //omit lines containing "X"
if(line.find(L"X")) {
outs<< line<< endl;
}
}
strcpy(outbytes, outs.rdbuf().c_str());
};
};
This gets a compiler error that:
no matching function for call to ‘std::basic_stringstream<wchar_t,
std::char_traits<wchar_t>, std::allocator<wchar_t> >::getline()’
If there's a clean solution to that, and that's the only significant
problem, that would be great.
Since this is my first c++ dealing with unicode, I fear there may be
more serious problems.
Please address the intention outlined above, more than my particular
code, assuming there is any hope at all...
I hope dearly to solve this cleanly without piling on extra libraries,
but I'll do what I must.

I don't know about the locale thing. Locales names are generally not
defined by the C and C++ standards (only "" is well-defined), and the
whole thing sucks in general, IMHO. But instead of

line = ins.getline()

write

getline( ins, line )

Outputting a `char*` to a wide stream, as you do before that, should
work nicely.

However, the strcpy at the end is a mixture of char and wchar_t and
ungood stuff. At that point, if you do things like you're doing them,
you have to convert back from wchar_t to sequences of UTF-8 char bytes.

Anyway, in C++ the signature should be more like

string doit( string const& inbytes )

I think a more reasonable approach than the streams, is to define two
helper functions

wstring utf32From( string const& utf8Data )

and

string utf16From( wstring const& utf32Data )

and

static_assert( sizeof( wchar_t ) == 4, "Hey, bad wchar_t size!" );

The C++11 standard defines functions that pretty directly can do the two
conversions above for you, I just leave it to you to read the documentation.

Cheers & hth.,

- Alf

Alf P. Steinbach · Mar 13, 2012

string utf16From( wstring const& utf32Data )

sorry, typo

given char* utf8, how to read unicode line by line, and output utf8

gry

Alf P. Steinbach

Alf P. Steinbach

Members online

Forum statistics

Latest Threads