How to read UTF-8 text files?

Z

Zephyre

I have some UTF-8 text files written in Chinese to be read. Now the
only method that I know to read text from it is to use fopen()
function. Thus, I must read the contents byte by byte, change the UTF-8
characters to Unicode, store the characters into wchar_t variables. But
I think this method is too complex and isn't elegant at all.

Are there any ways to read the UTF-8 text files as simple and
convenient as the way that we read ANSI text files? Thanks a lot~~
 
V

void

you should find some functions, which must have some parameter through
that you can choose which code.
maybe STL includes that kind of function. I'm not familiar with it. try
by yourself.
 
T

Tomás

Zephyre posted:
I have some UTF-8 text files written in Chinese to be read. Now the
only method that I know to read text from it is to use fopen()
function. Thus, I must read the contents byte by byte, change the UTF-8
characters to Unicode, store the characters into wchar_t variables. But
I think this method is too complex and isn't elegant at all.

Are there any ways to read the UTF-8 text files as simple and
convenient as the way that we read ANSI text files? Thanks a lot~~

I was writing a program just recently to convert between the different
encoding schemes for Unicode. I used std::bitset to read and write the
values. Look up "ifstream". It's easy to use like as follows:

ifstream in("blah.txt");

std::bitset<8> octet;

in >> octet;


and then when you're writing:

ofstream out("blah.txt");

std::bitset<32> thirtytwo;

out << thirtytwo;


-Tomás
 
L

loufoque

Zephyre wrote :
I have some UTF-8 text files written in Chinese to be read. Now the
only method that I know to read text from it is to use fopen()

fopen() is the C method.
In C++ we have iostreams.

I must read the contents byte by byte

You don't have to.
Just read everything at once.

change the UTF-8
characters to Unicode

Unicode isn't a character encoding by itself, only a character set.
You probably mean UCS-2 or UCS-4. Since you're using Windows terminology
I suppose you mean UCS-2, which is lossy.

Anyway you can simply work with utf-8, no need to convert to something else.


Are there any ways to read the UTF-8 text files as simple and
convenient as the way that we read ANSI text files? Thanks a lot~~

Simply read them as if they were "ANSI" (windows main locale) text files.
The only thing that changes is that a character may be multiple bytes.
If you really care about that being handled correctly use a set of
functions or classes dedicated to Unicode handling, like ICU or
Glib::ustring, that acts just like a std::string.
 
M

Michiel.Salters

Zephyre said:
I have some UTF-8 text files written in Chinese to be read. Now the
only method that I know to read text from it is to use fopen()
function. Thus, I must read the contents byte by byte, change the UTF-8
characters to Unicode, store the characters into wchar_t variables. But
I think this method is too complex and isn't elegant at all.

Yep, with C++ iostreams the only thing you need is an UTF-8 "codecvt
facet."
You might have one in your std:: library implementation, you could
write one,
or you could buy one (There's one in the Core library from Dinkumware)

HTH,
Michiel Salters
 
T

Tom Widmer

Yep, with C++ iostreams the only thing you need is an UTF-8 "codecvt
facet."
You might have one in your std:: library implementation, you could
write one,
or you could buy one (There's one in the Core library from Dinkumware)

There's an unsupported one hidden away in boost. You just need to do
something like to this to get it:

#define BOOST_UTF8_BEGIN_NAMESPACE namespace mynamespace {
#define BOOST_UTF8_END_NAMESPACE }
#define BOOST_UTF8_DECL
#include <boost/detail/utf8_codecvt_facet.hpp>

//...
std::wifstream ifs;
std::locale utf8loc(std::locale(),
new mynamespace::utf8_codecvt_facet());
ifs.imbue(utf9loc);
ifs.open(...);
//...

Tom
 
P

Phlip

Tom said:
#define BOOST_UTF8_BEGIN_NAMESPACE namespace mynamespace {
#define BOOST_UTF8_END_NAMESPACE }
#define BOOST_UTF8_DECL
#include <boost/detail/utf8_codecvt_facet.hpp>

//...
std::wifstream ifs;
std::locale utf8loc(std::locale(),
new mynamespace::utf8_codecvt_facet());
ifs.imbue(utf9loc);
ifs.open(...);

For those of us lost in iostream-style locales...

....then what? What code will behave differently because this stream is
imbued? Must I imbue std::strings and std::stringstreams, also, to store
UTF-8 in them?

(And to the original poster: Is this stuff answering your question, or did
you need to do something else with your text besides reading its data?)
 
T

Tom Widmer

Phlip said:
Tom Widmer wrote:




For those of us lost in iostream-style locales...

...then what? What code will behave differently because this stream is
imbued?

The wchar_t's that are read off the stream will be converted from the
utf8 multibyte characters. In effect, the input file is UTF8, but this
gives you a "view" of the file as UCS-2 (on Windows at least).

E.g.

int i;
ifs >> i; //reads a number

std::wstring ws;
std::getline(ifs, ws);
//ws will correctly contain any international chars

Must I imbue std::strings and std::stringstreams, also, to store
UTF-8 in them?

Well, it only applies to converting between wchars and raw bytes, and
that operation is most commonly performed with file (and network) IO.
So, for standard streams, it only applies to file streams, since other
streams don't perform any code conversion (e.g. wide string streams just
hold the characters in memory as wide characters, whereas wide file
streams have to convert between wide characters and raw bytes, which is
where codecvt comes in).

Tom
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,059
Latest member
cryptoseoagencies

Latest Threads

Top