Determining the encoding of a text file

Discussion in 'Python' started by Rajorshi, Mar 1, 2004.

  1. Rajorshi

    Rajorshi Guest

    Hello!
    How do I determine the encoding of a text file ? That is,
    given a text file I want to know the encoding it is in
    UTF8 or UTF16 or Latin etc. It would be very helpful if
    you could tell me how to do this in python on Linux. But
    just the method is acceptable.
    Thanks in advance!
     
    Rajorshi, Mar 1, 2004
    #1
    1. Advertising

  2. rajorshi> How do I determine the encoding of a text file ? That is,
    rajorshi> given a text file I want to know the encoding it is in UTF8 or
    rajorshi> UTF16 or Latin etc. It would be very helpful if you could tell
    rajorshi> me how to do this in python on Linux. But just the method is
    rajorshi> acceptable.

    In general this is not possible. You can guess using heuristics, but there is
    no predefined file attribute that indicates a file's encoding.

    If you have a small set of candidate encodings you can generally do a decent
    job guessing the encoding of a string by considering them in order. I placed
    an example on my Python Bits page: <http://www.musi-cal.com/~skip/python/>. I
    don't claim it's perfect and it's really only concerned with distiguishing
    utf-8 and a few encodings which are similar to iso-8859-1, but it does a
    decent job for me given the types of inputs I see.

    Skip
     
    Skip Montanaro, Mar 1, 2004
    #2
    1. Advertising

  3. Rajorshi

    David Opstad Guest

    In article <>,
    (Rajorshi) wrote:

    > How do I determine the encoding of a text file ? That is,
    > given a text file I want to know the encoding it is in
    > UTF8 or UTF16 or Latin etc. It would be very helpful if
    > you could tell me how to do this in python on Linux. But
    > just the method is acceptable.


    If the first byte in the file is 0xFE and the second is 0xFF, then it's
    likely the file is encoded in big-endian UTF-16. If the first byte is
    0xFF and the second is 0xFE, then it's likely to be little-endian UTF-16.

    Once you've eliminated those possibilities, then it gets trickier...

    Dave
     
    David Opstad, Mar 1, 2004
    #3
  4. Rajorshi

    J.R. Guest

    "Rajorshi" <> wrote in message
    news:...
    > Hello!
    > How do I determine the encoding of a text file ? That is,
    > given a text file I want to know the encoding it is in
    > UTF8 or UTF16 or Latin etc. It would be very helpful if
    > you could tell me how to do this in python on Linux. But
    > just the method is acceptable.
    > Thanks in advance!


    The python integrated development environment IDLE, which is distributed
    alone with python, shows one approach how to decode a
    string. You could find it in the file $PYTHON/lib/idlelib/IOBinding.py, find
    the decode().

    But it's not perfect, you could integrate with Skip's example writing your
    one.
    Additional, if you want to guess the Chinese encoding, the perl lib
    http://www.mandarintools.com/download/codelib.zip
    may be for your reference, it can support GB2312-80, Hz, Big5, UTF-8, etc.

    J.R.
     
    J.R., Mar 2, 2004
    #4
  5. Rajorshi

    Rajorshi Guest

    Thanks for your suggestions!


    "J.R." <> wrote in message news:<c20r4m$jn$>...
    > "Rajorshi" <> wrote in message
    > news:...
    > > Hello!
    > > How do I determine the encoding of a text file ? That is,
    > > given a text file I want to know the encoding it is in
    > > UTF8 or UTF16 or Latin etc. It would be very helpful if
    > > you could tell me how to do this in python on Linux. But
    > > just the method is acceptable.
    > > Thanks in advance!

    >
    > The python integrated development environment IDLE, which is distributed
    > alone with python, shows one approach how to decode a
    > string. You could find it in the file $PYTHON/lib/idlelib/IOBinding.py, find
    > the decode().
    >
    > But it's not perfect, you could integrate with Skip's example writing your
    > one.
    > Additional, if you want to guess the Chinese encoding, the perl lib
    > http://www.mandarintools.com/download/codelib.zip
    > may be for your reference, it can support GB2312-80, Hz, Big5, UTF-8, etc.
    >
    > J.R.
     
    Rajorshi, Mar 2, 2004
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Jon Maz
    Replies:
    1
    Views:
    401
    Joerg Jooss
    Jan 21, 2005
  2. Tony Houghton

    Determining encoding of a file

    Tony Houghton, Feb 3, 2007, in forum: Python
    Replies:
    3
    Views:
    328
    Tony Houghton
    Feb 4, 2007
  3. Alan
    Replies:
    5
    Views:
    1,046
    Mike Schilling
    Oct 7, 2007
  4. James Masters

    Determining if a file is binary or text

    James Masters, Sep 19, 2009, in forum: Ruby
    Replies:
    13
    Views:
    238
    James Masters
    Sep 21, 2009
  5. Replies:
    2
    Views:
    373
Loading...

Share This Page