Detecting Unicode files

Joe Gottman · Jul 12, 2007

I have an application that opens and reads a text file using an
ifstream. Recently one or two users have entered the name of a Unicode
file, which caused my program to crash because it couldn’t handle it.
Is there any way to determine after opening a file whether or not it
contains Unicode characters? I don’t want to read the Unicode, I just
want to be able to detect it so I can throw an exception.

Joe Gottman

Alf P. Steinbach · Jul 12, 2007

* Joe Gottman:

I have an application that opens and reads a text file using an
ifstream. Recently one or two users have entered the name of a Unicode
file, which caused my program to crash because it couldn’t handle it.
Is there any way to determine after opening a file whether or not it
contains Unicode characters? I don’t want to read the Unicode, I just
want to be able to detect it so I can throw an exception.

In general that would be the Wrong (TM) approach.

First fix your program so it doesn't crash but produces the usual
garbage result (GIGO) -- program should never crash on unexpected data.

Then, assuming an OS that doesn't associate the required information
with the file, detect Unicode by applying heuristics, starting with
checking for a byte order mark (BOM) at the start of the file, then
perhaps for typical UTF-8 patterns in the file contents.

Barry Ding · Jul 12, 2007

* Joe Gottman:

In general that would be the Wrong (TM) approach.

First fix your program so it doesn't crash but produces the usual
garbage result (GIGO) -- program should never crash on unexpected data.

Then, assuming an OS that doesn't associate the required information
with the file, detect Unicode by applying heuristics, starting with
checking for a byte order mark (BOM) at the start of the file, then
perhaps for typical UTF-8 patterns in the file contents.

--
A: Because it messes up the order in which people normally read text.
Q: Why is it such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?

Make a pure text file to surpport more than one encoding, proprobaly
wrong approach

robertwessel2 · Jul 12, 2007

I have an application that opens and reads a text file using an
ifstream. Recently one or two users have entered the name of a Unicode
file, which caused my program to crash because it couldn't handle it.
Is there any way to determine after opening a file whether or not it
contains Unicode characters? I don't want to read the Unicode, I just
want to be able to detect it so I can throw an exception.

While OT...

While it's not certain by any means, if the file starts with (0xff,
0xfe) or (0xfe, 0xff), it's probably a Unicode coded file. The
character 0xfffe is generally used to introduce Unicode files, and to
allow the detection of the byte order (endianess) of the file.

Alf P. Steinbach · Jul 12, 2007

* Barry Ding:

Make a pure text file to surpport more than one encoding, proprobaly
wrong approach

Don't quote signatures, please.

Apart from that, your comment is meaningless to me.

Which probably means that it's inherently meaningless.

Barry Ding · Jul 12, 2007

Make a pure text file to surpport more than one encoding, proprobaly

Don't quote signatures, please.

Apart from that, your comment is meaningless to me.

Which probably means that it's inherently meaningless.

sorry, just use newsgrooup for a few days,
I always just click the lastest post to answer the question for
convinience, so the post was meant for you

thanks for your guidance though it hurt a little

James Kanze · Jul 12, 2007

I have an application that opens and reads a text file using an
ifstream. Recently one or two users have entered the name of a Unicode
file, which caused my program to crash because it couldn?t handle it.
Is there any way to determine after opening a file whether or not it
contains Unicode characters? I don?t want to read the Unicode, I just
want to be able to detect it so I can throw an exception.

Several comments:

-- How do you write a program which crashes if a file contains
Unicode? How can it possibly matter?

-- Which encoding format? Just saying Unicode doesn't mean
anything; is it UTF-8, UTF-16 (LE or BE), or UTF-32 (LE or
BE)?

-- Typically, the OS isn't going to tell you, and nothing in
standard C++ will either. You'll just have to read a bit,
and use some heuristics. If every other byte is 0, or
almost, for example, you're probably looking at UTF-16. If
3 out of 4 bytes are 0 (more or less), UTF-32. For UTF-8,
look for the UTF-8 multibyte sequences.

-- And is Unicode really the problem? What happens if the user
specifies an executable? Or any other of a number of binary
file types?

In the end, I think Alf had the only realistic option: GIGO.
Presumably, your text file has some format, which you parse.
Finding unexpected characters should cause errors in the parse
(and not crash the program), so you output a message saying that
there is a problem at such and such a place in the file. (Try
renaming an executable .cpp, and feeding it to the C++ compiler.
You'll probably get a lot of error messages

, but the compiler
shouldn't crash.)

BobR · Jul 12, 2007

Barry Ding said:
sorry, just use newsgrooup for a few days,

I would never consider posting to a group until I had read there for a few
*weeks*, and/or read their FAQ.

FAQ http://www.parashift.com/c++-faq-lite

Converting EBCDIC to Unicode	3	Sep 28, 2010
I need help in understanding these files on my phone, Could someone help me understand these files? Urgent help needed. Please help.	1	Jun 4, 2023
Opening Unicode files?	7	Dec 25, 2011
Thinking Unicode	0	Aug 8, 2013
Looking for UNICODE to ASCII Conversioni Example Code	15	Oct 18, 2013
Unicode help please	5	Oct 19, 2013
Ascii to Unicode.	4	Jul 28, 2010
Need help with printing Unicode! (C++ on CentOS)	30	Aug 28, 2009

Detecting Unicode files

Joe Gottman

Alf P. Steinbach

Barry Ding

robertwessel2

Alf P. Steinbach

Barry Ding

James Kanze

BobR

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads