how to using codecvt to convert ascii<-->UTF-8 within std::ofstream

D

davihigh

My Friends:

I am using std::eek:fstream (as well as ifstream), I hope that when i
wrote in some std::string(...) with locale, ofstream can convert to
UTF-8 encoding and save file to disk. So does ifstream.

Something I found shows that, I need to have a proper codecvt to set
it. I need more information, maybe a small piece of code sample. Thank
you!

Regards,
David Xiao
 
D

davihigh

More information provided here:
My envionment is MS VC2003 along with build-in STL library.

And I am confused by codecvt<>.... Very appreciated if someone can
provide a codecvt<> that do "multibyte <--> UTF-8" conversion.

-David Xiao
 
P

P.J. Plauger

More information provided here:
My envionment is MS VC2003 along with build-in STL library.

And I am confused by codecvt<>.... Very appreciated if someone can
provide a codecvt<> that do "multibyte <--> UTF-8" conversion.

You get one with our CoreX package, available at our web site.
You might also find a free one at Boost, if you can afford the
time to locate it and make it work in your environment.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com
 
M

Michiel.Salters

My Friends:

I am using std::eek:fstream (as well as ifstream), I hope that when i
wrote in some std::string(...) with locale, ofstream can convert to
UTF-8 encoding and save file to disk. So does ifstream.

Probably no need to, if your Subject: line is correct. UTF-8 is an
8-bits superset of ASCII to deal with non-ASCII characters. If you
have an ASCII text, each 7-bits ASCII character is represented by a
single char in C++, and this value coincides with the UTF-8 value.

Of course, C++ compilers may use EBCDIC both internally for char
and/or externally for text files, which breaks the nice model - but in
that case an ASCII -> UTF-8 codecvt obviously won't work either.

HTH,
Michiel Salters
 
D

davihigh

I'm sorry for the incorrect wording in subject line...
I believe Michiel Salters 's words are right that, UTF-8 define its
"character" as variant length. For 7-bit char case, UTF8 compatiable
with ASCII.

Actually I am dealing with asian languages. For example, CP936 (chinese
GBK) or CP949(korea language). And I am looking for a way to deal with
it:

CONVERT locale(......) <-->UTF-8.

It better be graceful, working with std::eek:fstream ( I guess that will
need a codecvt<>) Of course, the ugly way is use some code to convert
the whole file.

Thanks P.J.Plauger for the suggestion. I found one codecvt<> in boost,
but it seems working on UTF-8<-->UTF-16.
Anyway, I am follow this thread with attention...

Regards, David Xiao
 
D

Dietmar Kuehl

I am using std::eek:fstream (as well as ifstream), I hope that when i
wrote in some std::string(...) with locale, ofstream can convert to
UTF-8 encoding and save file to disk. So does ifstream.

'std::eek:fstream' and 'std::ifstream' operate on 'char' objects. Assuming
an appropriate configuration is set up, these can be indeed converted
to UTF-8 but this is hardly really exciting: they would map to the
total of 256 characters.

It is important to understand that, at least conceptually, the standard
C++ library internally operates on characters where each character is
represented by one character object. Internally, no multi-byte
representation is supported. To cope with more than 256 characters,
you would use a different character type, e.g. 'wchar_t'. Effectively,
the idea was to use 'wchar_t' object to represent Unicode characters
which at the time when the standard C++ library was designed where
units of 16 bits and each unit represented an individual character.

Unfortunately, the Unicode people decided at some point that it would
be a really brilliant idea to throw all their fundamental assumptions
overboard and have combining character (i.e. suddenly some characters
were represented by more than one unit) and 20 bit characters. This
does not mix too well with C++, though: some compiler vendors had
decided that their 'wchar_t' shall be 16 bits wide and they are
essentially bound to this choice due to existing binary interfaces
already using 16 bit 'wchar_t's. To cope with these, the standard
library typically supports an internal UTF-16 representation although
most code actually uses the 'wchar_t' as UCS-2 entities, i.e. it does
not care about UTF-16, nor about combining characters.

Although the UTF-16 support somewhat muddies the water, in the context
of the standard C++ library you should best think in terms of
"characters", i.e. the entities used within a program which are stored
e.g. in 'std:basic_string<cT>' (each 'cT' representing one character),
and their "encoding", i.e. the entities ending up as bytes in a file.

You seem to have some internal multi-byte encoding which you
apparently want to write to some other multi-byte encoding with the
latter being UTF-8. At least, this is what I gather from your articles
and the subject of your articles: as Michiel correctly noted, you can
dump ASCII using the C locale (i.e. no conversion at all) into a file
and you would have a valid UTF-8 representation of your ASCII text.
One of the fundamental design decisions of Unicode which they haven't
thrown overboard (at least not when I last looked; I wouldn't put it
beyond them to do otherwise, though) is that each valid ASCII text is
a valid UTF-8 text with exactly the same interpretation.

I don't know whether Dinkumware's library really supports conversion
of arbitrary internal representations into (more or less arbitrary)
external representations but I would use the following approach anyway:
- Convert your multi-byte encoded text into a sequence of characters
using the normal internal representation, probably using an
appropriate code conversion facet.
- Use the characters internally in your internal processing, probably
taking care neither to rip combined characters nor UTF-16 character
apart.
- Have a suitable code conversion facet convert the internal
representation into whatever suitable encoding you want to use, e.g.
UTF-16.

I'm pretty sure that Dinkumware's library does the appropriate
conversions between an internal character representation and various
external encodings. I think there are also free alternatives but I
don't know any of them off-hand although I guess that the code
conversion facet you found at Boost does just the right thing: it
probably uses UTF-16 as the internal representation for characters
and converts between this character representation (although, from
a purist view this is not a suitable character representation at all)
and the UTF-8 encoding. You might need to find a code conversion
facet from whatever other encoding you are using to the internal
encoding (probably UTF-16 on Windows machines and UCS-4 on many other
systems).
 
P

P.J. Plauger

I'm sorry for the incorrect wording in subject line...
I believe Michiel Salters 's words are right that, UTF-8 define its
"character" as variant length. For 7-bit char case, UTF8 compatiable
with ASCII.

Actually I am dealing with asian languages. For example, CP936 (chinese
GBK) or CP949(korea language). And I am looking for a way to deal with
it:

CONVERT locale(......) <-->UTF-8.

It better be graceful, working with std::eek:fstream ( I guess that will
need a codecvt<>) Of course, the ugly way is use some code to convert
the whole file.

Our CoreX library has 80-odd codecvt facets, among them are ones
that convert:

-- between CP936 and UCS-2

-- between CP949 and UCS-2

-- between UTF-8 and UCS-2

You can use them with istream/ostream for file I/O or with an
in-memory string-to-string converter that we also supply.
Sounds like exactly what you need.
Thanks P.J.Plauger for the suggestion. I found one codecvt<> in boost,
but it seems working on UTF-8<-->UTF-16.
Anyway, I am follow this thread with attention...

Welcome.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com
 
P

P.J. Plauger

Dietmar Kuehl said:
(e-mail address removed) wrote:
You seem to have some internal multi-byte encoding which you
apparently want to write to some other multi-byte encoding with the
latter being UTF-8. At least, this is what I gather from your articles
and the subject of your articles: as Michiel correctly noted, you can
dump ASCII using the C locale (i.e. no conversion at all) into a file
and you would have a valid UTF-8 representation of your ASCII text.
One of the fundamental design decisions of Unicode which they haven't
thrown overboard (at least not when I last looked; I wouldn't put it
beyond them to do otherwise, though) is that each valid ASCII text is
a valid UTF-8 text with exactly the same interpretation.

I don't know whether Dinkumware's library really supports conversion
of arbitrary internal representations into (more or less arbitrary)
external representations

We have a rich set of pairwise transformations that can be chained
together in several useful ways.
but I would use the following approach anyway:
- Convert your multi-byte encoded text into a sequence of characters
using the normal internal representation, probably using an
appropriate code conversion facet.
- Use the characters internally in your internal processing, probably
taking care neither to rip combined characters nor UTF-16 character
apart.
- Have a suitable code conversion facet convert the internal
representation into whatever suitable encoding you want to use, e.g.
UTF-16.

I'm pretty sure that Dinkumware's library does the appropriate
conversions between an internal character representation and various
external encodings.
Yep.

I think there are also free alternatives but I
don't know any of them off-hand although I guess that the code
conversion facet you found at Boost does just the right thing: it
probably uses UTF-16 as the internal representation for characters
and converts between this character representation (although, from
a purist view this is not a suitable character representation at all)
and the UTF-8 encoding. You might need to find a code conversion
facet from whatever other encoding you are using to the internal
encoding (probably UTF-16 on Windows machines and UCS-4 on many other
systems).

Luckily, the OP clarified that he has no need for UTF-16. That
simplifies matters a bit.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com
 
D

davihigh

Thanks, Gentlemen.

I believe I got idea from all your posts. I also believe that i18n of
STL has a "graceful gap" as well.

Thanks again for your information!

Rgrds, David Xiao
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,756
Messages
2,569,535
Members
45,008
Latest member
obedient dusk

Latest Threads

Top