Determining the encoding of a text file

R

Rajorshi

Hello!
How do I determine the encoding of a text file ? That is,
given a text file I want to know the encoding it is in
UTF8 or UTF16 or Latin etc. It would be very helpful if
you could tell me how to do this in python on Linux. But
just the method is acceptable.
Thanks in advance!
 
S

Skip Montanaro

rajorshi> How do I determine the encoding of a text file ? That is,
rajorshi> given a text file I want to know the encoding it is in UTF8 or
rajorshi> UTF16 or Latin etc. It would be very helpful if you could tell
rajorshi> me how to do this in python on Linux. But just the method is
rajorshi> acceptable.

In general this is not possible. You can guess using heuristics, but there is
no predefined file attribute that indicates a file's encoding.

If you have a small set of candidate encodings you can generally do a decent
job guessing the encoding of a string by considering them in order. I placed
an example on my Python Bits page: <http://www.musi-cal.com/~skip/python/>. I
don't claim it's perfect and it's really only concerned with distiguishing
utf-8 and a few encodings which are similar to iso-8859-1, but it does a
decent job for me given the types of inputs I see.

Skip
 
D

David Opstad

How do I determine the encoding of a text file ? That is,
given a text file I want to know the encoding it is in
UTF8 or UTF16 or Latin etc. It would be very helpful if
you could tell me how to do this in python on Linux. But
just the method is acceptable.

If the first byte in the file is 0xFE and the second is 0xFF, then it's
likely the file is encoded in big-endian UTF-16. If the first byte is
0xFF and the second is 0xFE, then it's likely to be little-endian UTF-16.

Once you've eliminated those possibilities, then it gets trickier...

Dave
 
J

J.R.

Rajorshi said:
Hello!
How do I determine the encoding of a text file ? That is,
given a text file I want to know the encoding it is in
UTF8 or UTF16 or Latin etc. It would be very helpful if
you could tell me how to do this in python on Linux. But
just the method is acceptable.
Thanks in advance!

The python integrated development environment IDLE, which is distributed
alone with python, shows one approach how to decode a
string. You could find it in the file $PYTHON/lib/idlelib/IOBinding.py, find
the decode().

But it's not perfect, you could integrate with Skip's example writing your
one.
Additional, if you want to guess the Chinese encoding, the perl lib
http://www.mandarintools.com/download/codelib.zip
may be for your reference, it can support GB2312-80, Hz, Big5, UTF-8, etc.

J.R.
 
R

Rajorshi

Thanks for your suggestions!


J.R. said:
The python integrated development environment IDLE, which is distributed
alone with python, shows one approach how to decode a
string. You could find it in the file $PYTHON/lib/idlelib/IOBinding.py, find
the decode().

But it's not perfect, you could integrate with Skip's example writing your
one.
Additional, if you want to guess the Chinese encoding, the perl lib
http://www.mandarintools.com/download/codelib.zip
may be for your reference, it can support GB2312-80, Hz, Big5, UTF-8, etc.

J.R.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,743
Messages
2,569,478
Members
44,899
Latest member
RodneyMcAu

Latest Threads

Top