how to detect the encoding used for a specific text data ?

Discussion in 'Python' started by iMath, Dec 20, 2012.

  1. iMath

    iMath Guest

    how to detect the encoding used for a specific text data ?
    iMath, Dec 20, 2012
    #1
    1. Advertising

  2. iMath

    iMath Guest

    在 2012å¹´12月20日星期四UTC+8下åˆ7æ—¶57分19秒,iMath写é“:

    > how to detect the encoding used for a specific text data ?


    On windows XP
    iMath, Dec 20, 2012
    #2
    1. Advertising

  3. iMath writes:

    > how to detect the encoding used for a specific text data ?


    The practical thing to do is to try an encoding and see whether you
    find the expected frequent letters of the relevant languages in the
    decoded text, or the most frequent words. This is likely to help you
    decide between some of the most common encodings. Some decoding
    attempts may even raise an exception, which should be a clue.

    Strictly speaking, it cannot be done with complete certainty. There
    are lots of Finnish texts that are identical whether you think they
    are in Latin-1 or Latin-9. A further text from the same source might
    still reveal the difference, so the distinction matters.

    Short Finnish texts might also be identical whether you think they are
    in Latin-1 or UTF-8, but the situation is different: a couple of
    frequent letters turn into nonsense in the wrong encoding. It's easy
    to tell at a glance.

    Sometimes texts declare their encoding. That should be a clue, but in
    practice the declaration may be false. Sometimes there is a stray
    character that violates the declared or assumed encoding, or a part of
    the text is in one encoding and another part in another. Bad source.
    You decide how important it is to deal with the mess. (This only
    happens in the real world.)

    Good luck.
    Jussi Piitulainen, Dec 20, 2012
    #3
  4. Stefan H. Holek, Dec 20, 2012
    #4
  5. iMath

    iMath Guest

    which package to use ?
    iMath, Dec 20, 2012
    #5
  6. iMath

    iMath Guest

    which package to use ?
    iMath, Dec 20, 2012
    #6
  7. iMath writes:

    > which package to use ?


    Read the text in as a "bytes object" (bytes), then it has a .decode
    method that you can experiment with. Strings (str) are Unicode and
    have an .encode method. These methods allow you to specify a desired
    encoding and and what to do when there are errors.

    help(bytes.decode)
    help(str.encode)
    help(open)
    <http://docs.python.org/3.3/library/stdtypes.html>

    In Python 2.7 and before, strings seem to do double duty and have both
    the .encode and .decode methods, so Python version matters here.
    Jussi Piitulainen, Dec 20, 2012
    #7
  8. iMath

    Guest

    On Thursday, December 20, 2012 4:57:19 AM UTC-7, iMath wrote:
    > how to detect the encoding used for a specific text data ?


    The chardet package will probably do what you want:
    http://pypi.python.org/pypi/chardet
    , Dec 20, 2012
    #8
  9. iMath

    Dave Angel Guest

    On 12/21/2012 07:38 AM, Oscar Benjamin wrote:
    > <snip>
    > On a related note: how to answer question with no context on mailing
    > list?


    Depends on how you're reading/responding. I'll assume you're using an
    email client like Thunderbird, and that you do NOT subscribe in digest form.

    Most general way is to use Reply-All, and remove any recipients you
    don't want there, but make sure you keep the python-list recipient.

    Alternatively, if you're using Thunderbird or another with similar
    capability, use Reply-list, which is smart enough to only keep the list
    entry.

    Or, what I used to do, reply, then add the to the
    list of recipients. That's error prone.

    I hope this answers your question.


    --

    DaveA
    Dave Angel, Dec 21, 2012
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. HK
    Replies:
    7
    Views:
    8,588
    John C. Bollinger
    Jun 7, 2005
  2. Replies:
    22
    Views:
    1,465
    Ilya Zakharevich
    May 22, 2006
  3. Ted Byers
    Replies:
    23
    Views:
    422
    Peter J. Holzer
    Nov 15, 2008
  4. Replies:
    2
    Views:
    367
  5. Detect a specific keypress

    , Nov 7, 2005, in forum: Javascript
    Replies:
    8
    Views:
    91
Loading...

Share This Page