Distinguishing cp850 and cp1252?

Discussion in 'Python' started by David Eppstein, Nov 3, 2003.

  1. I'm working on some Python code for reading files in a certain format,
    and the examples of such files I've found on the internet appear to be
    in either cp850 or cp1252 encoding (except for one exception for which I
    can't find a correct encoding among the standard Python ones).

    The file format itself includes nothing about which encoding is used,
    but only one of the two produces sensible results in the non-ascii
    examples I've seen.

    Is there an easy way of guessing with reasonable accuracy which of these
    two incodings was used for a particular file?

    --
    David Eppstein http://www.ics.uci.edu/~eppstein/
    Univ. of California, Irvine, School of Information & Computer Science
    David Eppstein, Nov 3, 2003
    #1
    1. Advertising

  2. David Eppstein

    John Roth Guest

    "David Eppstein" <> wrote in message
    news:...
    > I'm working on some Python code for reading files in a certain format,
    > and the examples of such files I've found on the internet appear to be
    > in either cp850 or cp1252 encoding (except for one exception for which I
    > can't find a correct encoding among the standard Python ones).
    >
    > The file format itself includes nothing about which encoding is used,
    > but only one of the two produces sensible results in the non-ascii
    > examples I've seen.
    >
    > Is there an easy way of guessing with reasonable accuracy which of these
    > two incodings was used for a particular file?


    The only way I know of is to do a statistical analysis on letter
    frequencies. To do that, you have to know your data fairly well.
    For example, CP850 has a number of characters devoted to box
    drawing characters. If your data doesn't involve drawing boxes,
    and you find those characters in the input, I'd say that's a strong
    clue that you're dealing with CP1252.

    I know this doesn't help all that much, but it's the only thing
    that has worked for me.

    John Roth
    >
    > --
    > David Eppstein http://www.ics.uci.edu/~eppstein/
    > Univ. of California, Irvine, School of Information & Computer Science
    John Roth, Nov 3, 2003
    #2
    1. Advertising

  3. David Eppstein wrote:

    > Is there an easy way of guessing with reasonable accuracy which of these
    > two incodings was used for a particular file?


    You could try the assumption that most characters should be letters,
    assuming your documents are likely text documents of some sort. The idea
    is that what is a letter in one code is some non-letter graphical symbol
    in the other.

    So you would create a predicate "isletter" for each character set, and
    then count the number of bytes in a document which are not letters. You
    should probably exclude the ASCII characters in counting, since they
    would have the same interpretation in either code. The code that gives
    you fewer/none no-letter characters is likely the correct
    interpretation.

    To find out which bytes are letters, you could use unicodedata.category;
    letters start with "L" (followed by either "l" or "u", depending on
    case). You should compute a bitmap for each character set up-front, and
    you should find out what the overlap in set bits is.

    To get a higher accuracy, you need advance knowledge on the natural
    language your documents are in, and then you need to use a dictionary
    of that language.

    HTH,
    Martin
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=, Nov 3, 2003
    #3
  4. In article <>,
    "John Roth" <> wrote:

    > > Is there an easy way of guessing with reasonable accuracy which of these
    > > two incodings was used for a particular file?

    >
    > The only way I know of is to do a statistical analysis on letter
    > frequencies. To do that, you have to know your data fairly well.
    > For example, CP850 has a number of characters devoted to box
    > drawing characters. If your data doesn't involve drawing boxes,
    > and you find those characters in the input, I'd say that's a strong
    > clue that you're dealing with CP1252.


    Thanks. After trying some other more hackish things which sort of
    worked (e.g. does the encoding lead to unicodes with ord>255?) I settled
    on a very simple statistical scheme: vote for how many times the
    encoding produces unicodes that answer true to isalpha(). Seems to be
    working...

    --
    David Eppstein http://www.ics.uci.edu/~eppstein/
    Univ. of California, Irvine, School of Information & Computer Science
    David Eppstein, Nov 3, 2003
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Mickey Segal
    Replies:
    5
    Views:
    33,531
    Mickey Segal
    Apr 20, 2005
  2. Do Re Mi chel La Si Do

    To circumvent the bug cp1252

    Do Re Mi chel La Si Do, May 15, 2005, in forum: Python
    Replies:
    0
    Views:
    423
    Do Re Mi chel La Si Do
    May 15, 2005
  3. =?iso-8859-1?B?bW9vcJk=?=

    Cp1252 problem

    =?iso-8859-1?B?bW9vcJk=?=, Sep 27, 2006, in forum: Java
    Replies:
    2
    Views:
    41,064
    Mike Schilling
    Sep 27, 2006
  4. Méta-MCI

    Bug? import cp1252

    Méta-MCI, May 12, 2007, in forum: Python
    Replies:
    2
    Views:
    473
    Méta-MCI
    May 14, 2007
  5. Replies:
    12
    Views:
    370
    Dennis Lee Bieber
    Nov 18, 2012
Loading...

Share This Page