How test the encoding of a file ?

Y

YGUEL

Hello,
do you know a good program to test what sort of charachters encoding
is used in a file.
I use iconv but it only can translate from a charachter encoding to an
other. The problem is that I have some files and the way I get them
doesn't assure me that what encoding they pretend to be is the one
they use.

Thanks for threading on this subject with me.

P.S. I doesn't think that test all the encoding possibilities with
iconv is a good solution.
 
M

Manuel Yguel

YGUEL said:
Hello,
do you know a good program to test what sort of charachters encoding
is used in a file.
I use iconv but it only can translate from a charachter encoding to an
other. The problem is that I have some files and the way I get them
doesn't assure me that what encoding they pretend to be is the one
they use.

Thanks for threading on this subject with me.

P.S. I doesn't think that test all the encoding possibilities with
iconv is a good solution.
I have see the Appendix F of XML 1.0 but does-it exists a code which
does that ?
 
T

Toni Uusitalo

YGUEL said:
Hello,
do you know a good program to test what sort of charachters encoding
is used in a file.

Conformant xml parsers do this up to certain point (the ones that implements
xml spec 1.0 appendix F as you mentioned).
I use iconv but it only can translate from a charachter encoding to an
other. The problem is that I have some files and the way I get them
doesn't assure me that what encoding they pretend to be is the one
they use.

The problem here is there is no idiot proof way to do this -
if we have this kind of document for example:
<?xml version="1.0" encoding="ISO-8859-1"?>
<doc>*</doc>

where * would be copyright sign for example (ASCII value xA9)
BUT despite of ISO-8859-1 being specified document would have
been saved in UTF-8 and thus * would be saved as ASCII
values xC2xA9. Now if you load that file with xml parser
you get xC3x82xC2xA9 (first 2 bytes is xC2 converted to ÚTF-8
and last to bytes is A9 converted to UTF-8)
bytes xC2 and xA9 being perfectly legal latin1 characters, how
would you detect that the file was saved in wrong encoding?
Thanks for threading on this subject with me.

P.S. I doesn't think that test all the encoding possibilities with
iconv is a good solution.

If you're dealing with xml, xml declaration with encoding="whatever"
specified would be only recognized by an xml parser, not iconv,
there might be some solutions available I'm not aware though, try google.

with respect,
Toni Uusitalo
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top