Detecting Unicode files

J

Joe Gottman

I have an application that opens and reads a text file using an
ifstream. Recently one or two users have entered the name of a Unicode
file, which caused my program to crash because it couldn’t handle it.
Is there any way to determine after opening a file whether or not it
contains Unicode characters? I don’t want to read the Unicode, I just
want to be able to detect it so I can throw an exception.



Joe Gottman
 
A

Alf P. Steinbach

* Joe Gottman:
I have an application that opens and reads a text file using an
ifstream. Recently one or two users have entered the name of a Unicode
file, which caused my program to crash because it couldn’t handle it.
Is there any way to determine after opening a file whether or not it
contains Unicode characters? I don’t want to read the Unicode, I just
want to be able to detect it so I can throw an exception.

In general that would be the Wrong (TM) approach.

First fix your program so it doesn't crash but produces the usual
garbage result (GIGO) -- program should never crash on unexpected data.

Then, assuming an OS that doesn't associate the required information
with the file, detect Unicode by applying heuristics, starting with
checking for a byte order mark (BOM) at the start of the file, then
perhaps for typical UTF-8 patterns in the file contents.
 
B

Barry Ding

* Joe Gottman:


In general that would be the Wrong (TM) approach.

First fix your program so it doesn't crash but produces the usual
garbage result (GIGO) -- program should never crash on unexpected data.

Then, assuming an OS that doesn't associate the required information
with the file, detect Unicode by applying heuristics, starting with
checking for a byte order mark (BOM) at the start of the file, then
perhaps for typical UTF-8 patterns in the file contents.

--
A: Because it messes up the order in which people normally read text.
Q: Why is it such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?

Make a pure text file to surpport more than one encoding, proprobaly
wrong approach
 
R

robertwessel2

I have an application that opens and reads a text file using an
ifstream. Recently one or two users have entered the name of a Unicode
file, which caused my program to crash because it couldn't handle it.
Is there any way to determine after opening a file whether or not it
contains Unicode characters? I don't want to read the Unicode, I just
want to be able to detect it so I can throw an exception.


While OT...

While it's not certain by any means, if the file starts with (0xff,
0xfe) or (0xfe, 0xff), it's probably a Unicode coded file. The
character 0xfffe is generally used to introduce Unicode files, and to
allow the detection of the byte order (endianess) of the file.
 
A

Alf P. Steinbach

* Barry Ding:
Make a pure text file to surpport more than one encoding, proprobaly
wrong approach

Don't quote signatures, please.

Apart from that, your comment is meaningless to me.

Which probably means that it's inherently meaningless.
 
B

Barry Ding

Make a pure text file to surpport more than one encoding, proprobaly
Don't quote signatures, please.

Apart from that, your comment is meaningless to me.

Which probably means that it's inherently meaningless.

sorry, just use newsgrooup for a few days,
I always just click the lastest post to answer the question for
convinience, so the post was meant for you

thanks for your guidance though it hurt a little
 
J

James Kanze

I have an application that opens and reads a text file using an
ifstream. Recently one or two users have entered the name of a Unicode
file, which caused my program to crash because it couldn?t handle it.
Is there any way to determine after opening a file whether or not it
contains Unicode characters? I don?t want to read the Unicode, I just
want to be able to detect it so I can throw an exception.

Several comments:

-- How do you write a program which crashes if a file contains
Unicode? How can it possibly matter?

-- Which encoding format? Just saying Unicode doesn't mean
anything; is it UTF-8, UTF-16 (LE or BE), or UTF-32 (LE or
BE)?

-- Typically, the OS isn't going to tell you, and nothing in
standard C++ will either. You'll just have to read a bit,
and use some heuristics. If every other byte is 0, or
almost, for example, you're probably looking at UTF-16. If
3 out of 4 bytes are 0 (more or less), UTF-32. For UTF-8,
look for the UTF-8 multibyte sequences.

-- And is Unicode really the problem? What happens if the user
specifies an executable? Or any other of a number of binary
file types?

In the end, I think Alf had the only realistic option: GIGO.
Presumably, your text file has some format, which you parse.
Finding unexpected characters should cause errors in the parse
(and not crash the program), so you output a message saying that
there is a problem at such and such a place in the file. (Try
renaming an executable .cpp, and feeding it to the C++ compiler.
You'll probably get a lot of error messages:), but the compiler
shouldn't crash.)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,766
Messages
2,569,569
Members
45,042
Latest member
icassiem

Latest Threads

Top