Distinguishing cp850 and cp1252?

David Eppstein · Nov 3, 2003

I'm working on some Python code for reading files in a certain format,
and the examples of such files I've found on the internet appear to be
in either cp850 or cp1252 encoding (except for one exception for which I
can't find a correct encoding among the standard Python ones).

The file format itself includes nothing about which encoding is used,
but only one of the two produces sensible results in the non-ascii
examples I've seen.

Is there an easy way of guessing with reasonable accuracy which of these
two incodings was used for a particular file?

John Roth · Nov 3, 2003

David Eppstein said:
I'm working on some Python code for reading files in a certain format,
and the examples of such files I've found on the internet appear to be
in either cp850 or cp1252 encoding (except for one exception for which I
can't find a correct encoding among the standard Python ones).

The file format itself includes nothing about which encoding is used,
but only one of the two produces sensible results in the non-ascii
examples I've seen.

Is there an easy way of guessing with reasonable accuracy which of these
two incodings was used for a particular file?

The only way I know of is to do a statistical analysis on letter
frequencies. To do that, you have to know your data fairly well.
For example, CP850 has a number of characters devoted to box
drawing characters. If your data doesn't involve drawing boxes,
and you find those characters in the input, I'd say that's a strong
clue that you're dealing with CP1252.

I know this doesn't help all that much, but it's the only thing
that has worked for me.

John Roth

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Nov 3, 2003

David said:
Is there an easy way of guessing with reasonable accuracy which of these
two incodings was used for a particular file?

You could try the assumption that most characters should be letters,
assuming your documents are likely text documents of some sort. The idea
is that what is a letter in one code is some non-letter graphical symbol
in the other.

So you would create a predicate "isletter" for each character set, and
then count the number of bytes in a document which are not letters. You
should probably exclude the ASCII characters in counting, since they
would have the same interpretation in either code. The code that gives
you fewer/none no-letter characters is likely the correct
interpretation.

To find out which bytes are letters, you could use unicodedata.category;
letters start with "L" (followed by either "l" or "u", depending on
case). You should compute a bitmap for each character set up-front, and
you should find out what the overlap in set bits is.

To get a higher accuracy, you need advance knowledge on the natural
language your documents are in, and then you need to use a dictionary
of that language.

HTH,
Martin

David Eppstein · Nov 3, 2003

Is there an easy way of guessing with reasonable accuracy which of these
two incodings was used for a particular file?

The only way I know of is to do a statistical analysis on letter
frequencies. To do that, you have to know your data fairly well.
For example, CP850 has a number of characters devoted to box
drawing characters. If your data doesn't involve drawing boxes,
and you find those characters in the input, I'd say that's a strong
clue that you're dealing with CP1252.[/QUOTE]

Thanks. After trying some other more hackish things which sort of
worked (e.g. does the encoding lead to unicodes with ord>255?) I settled
on a very simple statistical scheme: vote for how many times the
encoding produces unicodes that answer true to isalpha(). Seems to be
working...

latin1 and cp1252 inconsistent?	12	Nov 16, 2012
Distinguishing active generators from exhausted ones	15	Jul 25, 2009
Issue 1602, cp65001, powershell and python3 crash	1	Jan 16, 2011
subprocess decoding?	1	Oct 28, 2006
Beginner's Guide to getting CipherSweet working with PDO and MYSQL	1	Dec 1, 2022
Unicode help please	5	Oct 19, 2013
Python 3.3, gettext and Unicode problems	0	Dec 31, 2012
preferred way to set encoding for print	5	Sep 15, 2009

Distinguishing cp850 and cp1252?

David Eppstein

John Roth

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

David Eppstein

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads