How to read UTF-8 text files?

Zephyre · Apr 25, 2006

I have some UTF-8 text files written in Chinese to be read. Now the
only method that I know to read text from it is to use fopen()
function. Thus, I must read the contents byte by byte, change the UTF-8
characters to Unicode, store the characters into wchar_t variables. But
I think this method is too complex and isn't elegant at all.

Are there any ways to read the UTF-8 text files as simple and
convenient as the way that we read ANSI text files? Thanks a lot~~

void · Apr 25, 2006

you should find some functions, which must have some parameter through
that you can choose which code.
maybe STL includes that kind of function. I'm not familiar with it. try
by yourself.

Tomás · Apr 25, 2006

Zephyre posted:

I have some UTF-8 text files written in Chinese to be read. Now the
only method that I know to read text from it is to use fopen()
function. Thus, I must read the contents byte by byte, change the UTF-8
characters to Unicode, store the characters into wchar_t variables. But
I think this method is too complex and isn't elegant at all.

Are there any ways to read the UTF-8 text files as simple and
convenient as the way that we read ANSI text files? Thanks a lot~~

I was writing a program just recently to convert between the different
encoding schemes for Unicode. I used std::bitset to read and write the
values. Look up "ifstream". It's easy to use like as follows:

ifstream in("blah.txt");

std::bitset<8> octet;

in >> octet;

and then when you're writing:

ofstream out("blah.txt");

std::bitset<32> thirtytwo;

out << thirtytwo;

-Tomás

Raider · Apr 25, 2006

Try using ICU library.
http://www-306.ibm.com/software/globalization/icu/index.jsp

loufoque · Apr 25, 2006

Zephyre wrote :

I have some UTF-8 text files written in Chinese to be read. Now the
only method that I know to read text from it is to use fopen()

fopen() is the C method.
In C++ we have iostreams.

I must read the contents byte by byte

You don't have to.
Just read everything at once.

change the UTF-8
characters to Unicode

Unicode isn't a character encoding by itself, only a character set.
You probably mean UCS-2 or UCS-4. Since you're using Windows terminology
I suppose you mean UCS-2, which is lossy.

Anyway you can simply work with utf-8, no need to convert to something else.

Are there any ways to read the UTF-8 text files as simple and
convenient as the way that we read ANSI text files? Thanks a lot~~

Simply read them as if they were "ANSI" (windows main locale) text files.
The only thing that changes is that a character may be multiple bytes.
If you really care about that being handled correctly use a set of
functions or classes dedicated to Unicode handling, like ICU or
Glib::ustring, that acts just like a std::string.

Michiel.Salters · Apr 25, 2006

Zephyre said:
I have some UTF-8 text files written in Chinese to be read. Now the
only method that I know to read text from it is to use fopen()
function. Thus, I must read the contents byte by byte, change the UTF-8
characters to Unicode, store the characters into wchar_t variables. But
I think this method is too complex and isn't elegant at all.

Yep, with C++ iostreams the only thing you need is an UTF-8 "codecvt
facet."
You might have one in your std:: library implementation, you could
write one,
or you could buy one (There's one in the Core library from Dinkumware)

HTH,
Michiel Salters

Tom Widmer · Apr 25, 2006

Yep, with C++ iostreams the only thing you need is an UTF-8 "codecvt
facet."
You might have one in your std:: library implementation, you could
write one,
or you could buy one (There's one in the Core library from Dinkumware)

There's an unsupported one hidden away in boost. You just need to do
something like to this to get it:

#define BOOST_UTF8_BEGIN_NAMESPACE namespace mynamespace {
#define BOOST_UTF8_END_NAMESPACE }
#define BOOST_UTF8_DECL
#include <boost/detail/utf8_codecvt_facet.hpp>

//...
std::wifstream ifs;
std::locale utf8loc(std::locale(),
new mynamespace::utf8_codecvt_facet());
ifs.imbue(utf9loc);
ifs.open(...);
//...

Tom

Phlip · Apr 25, 2006

Tom said:
#define BOOST_UTF8_BEGIN_NAMESPACE namespace mynamespace {
#define BOOST_UTF8_END_NAMESPACE }
#define BOOST_UTF8_DECL
#include <boost/detail/utf8_codecvt_facet.hpp>

//...
std::wifstream ifs;
std::locale utf8loc(std::locale(),
new mynamespace::utf8_codecvt_facet());
ifs.imbue(utf9loc);
ifs.open(...);

For those of us lost in iostream-style locales...

....then what? What code will behave differently because this stream is
imbued? Must I imbue std::strings and std::stringstreams, also, to store
UTF-8 in them?

(And to the original poster: Is this stuff answering your question, or did
you need to do something else with your text besides reading its data?)

Tom Widmer · Apr 25, 2006

Phlip said:
Tom Widmer wrote:

For those of us lost in iostream-style locales...

...then what? What code will behave differently because this stream is
imbued?

The wchar_t's that are read off the stream will be converted from the
utf8 multibyte characters. In effect, the input file is UTF8, but this
gives you a "view" of the file as UCS-2 (on Windows at least).

E.g.

int i;
ifs >> i; //reads a number

std::wstring ws;
std::getline(ifs, ws);
//ws will correctly contain any international chars

Must I imbue std::strings and std::stringstreams, also, to store

UTF-8 in them?

Well, it only applies to converting between wchars and raw bytes, and
that operation is most commonly performed with file (and network) IO.
So, for standard streams, it only applies to file streams, since other
streams don't perform any code conversion (e.g. wide string streams just
hold the characters in memory as wide characters, whereas wide file
streams have to convert between wide characters and raw bytes, which is
where codecvt comes in).

Tom

Batch Convert HTML to UTF-8 Files	2	Oct 2, 2023
UTF-8 read & print?	6	Nov 25, 2012
UTF-8 and strings	44	Jun 7, 2011
Unicode (UTF-8) in C	13	Mar 16, 2014
Read utf-8 file	1	Mar 18, 2013
UTF-8 vs w_char	48	Nov 3, 2013
MeCab UTF-8 Decoding Problem	6	Jun 29, 2013
Find and count strings of text from multiple files	17	Dec 16, 2021

How to read UTF-8 text files?

Zephyre

void

Tomás

Raider

loufoque

Michiel.Salters

Tom Widmer

Phlip

Tom Widmer

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads