File Read Spanish characters

C

Chip

There is surprisingly little information on the various encoding options for
reading a text file. I have what seems to be a very basic issue: I'm reading
a text file that includes Spanish characters such as "ñ". When I read the
file into a string, that character is missing. Encoding seems to be the
culprit. File writers SHOULD begin a file with the BOM (Byte Order Mark) to
let us know what encoding to read the file with, but most software doesn't
do this so we are left with BOMless files. So how can we reliably read these
files without knowing what encoding it was written with?

Through trial and error I have found that using UTF-7 picks up these Spanish
characters, along with the English.
Dim Reader As New StreamReader(fs, System.Text.Encoding.UTF7).

Since I am clueless on matters of encoding, my question is: am I safe using
UTF-7 if I only care about English and Spanish? What is the downside? I
won't be able to read Romanian? Japanese?

Is there a way to programatically find the correct encoding without the BOM?

Chip
 
J

Joerg Jooss

Chip said:
There is surprisingly little information on the various encoding
options for reading a text file. I have what seems to be a very basic
issue: I'm reading a text file that includes Spanish characters such
as "ñ". When I read the file into a string, that character is
missing. Encoding seems to be the culprit. File writers SHOULD begin
a file with the BOM (Byte Order Mark) to let us know what encoding to
read the file with, but most software doesn't do this so we are left
with BOMless files.

Remember that these are byte order marks, which are intended to be used
for identifying whether an encoding uses Big Endian or Little Endian
representation. The fact that some encodings can be identified by their
BOM is just a nice side effect.
So how can we reliably read these files without
knowing what encoding it was written with?

Only through application specific meta data (like HTTP headers).
There's no grand universal scheme to tell a file's character encoding.
Through trial and error I have found that using UTF-7 picks up these
Spanish characters, along with the English. Dim Reader As New
StreamReader(fs, System.Text.Encoding.UTF7).

That's quite likely not what you want. Try Encoding.Default.
Since I am clueless on matters of encoding, my question is: am I safe
using UTF-7 if I only care about English and Spanish? What is the
downside? I won't be able to read Romanian? Japanese?

Depends on the input. UTF-7 is only (and rarely?) used for E-mail. I
guess the chance to find a true UTF-7 encoded file is pretty much zero.
Is there a way to programatically find the correct encoding without
the BOM?

As I said, in general no. If the range of possible encodings is
limited, you may be able to create a proper detection algorithm, though.

Cheers,
 
J

Juan T. Llibre

If you only care about english and spanish,
you'll be safe using iso-8859-1.



Juan T. Llibre
ASP.NET MVP
============
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,020
Latest member
GenesisGai

Latest Threads

Top