Distinguishing cp850 and cp1252?

D

David Eppstein

I'm working on some Python code for reading files in a certain format,
and the examples of such files I've found on the internet appear to be
in either cp850 or cp1252 encoding (except for one exception for which I
can't find a correct encoding among the standard Python ones).

The file format itself includes nothing about which encoding is used,
but only one of the two produces sensible results in the non-ascii
examples I've seen.

Is there an easy way of guessing with reasonable accuracy which of these
two incodings was used for a particular file?
 
J

John Roth

David Eppstein said:
I'm working on some Python code for reading files in a certain format,
and the examples of such files I've found on the internet appear to be
in either cp850 or cp1252 encoding (except for one exception for which I
can't find a correct encoding among the standard Python ones).

The file format itself includes nothing about which encoding is used,
but only one of the two produces sensible results in the non-ascii
examples I've seen.

Is there an easy way of guessing with reasonable accuracy which of these
two incodings was used for a particular file?

The only way I know of is to do a statistical analysis on letter
frequencies. To do that, you have to know your data fairly well.
For example, CP850 has a number of characters devoted to box
drawing characters. If your data doesn't involve drawing boxes,
and you find those characters in the input, I'd say that's a strong
clue that you're dealing with CP1252.

I know this doesn't help all that much, but it's the only thing
that has worked for me.

John Roth
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

David said:
Is there an easy way of guessing with reasonable accuracy which of these
two incodings was used for a particular file?

You could try the assumption that most characters should be letters,
assuming your documents are likely text documents of some sort. The idea
is that what is a letter in one code is some non-letter graphical symbol
in the other.

So you would create a predicate "isletter" for each character set, and
then count the number of bytes in a document which are not letters. You
should probably exclude the ASCII characters in counting, since they
would have the same interpretation in either code. The code that gives
you fewer/none no-letter characters is likely the correct
interpretation.

To find out which bytes are letters, you could use unicodedata.category;
letters start with "L" (followed by either "l" or "u", depending on
case). You should compute a bitmap for each character set up-front, and
you should find out what the overlap in set bits is.

To get a higher accuracy, you need advance knowledge on the natural
language your documents are in, and then you need to use a dictionary
of that language.

HTH,
Martin
 
D

David Eppstein

Is there an easy way of guessing with reasonable accuracy which of these
two incodings was used for a particular file?

The only way I know of is to do a statistical analysis on letter
frequencies. To do that, you have to know your data fairly well.
For example, CP850 has a number of characters devoted to box
drawing characters. If your data doesn't involve drawing boxes,
and you find those characters in the input, I'd say that's a strong
clue that you're dealing with CP1252.[/QUOTE]

Thanks. After trying some other more hackish things which sort of
worked (e.g. does the encoding lead to unicodes with ord>255?) I settled
on a very simple statistical scheme: vote for how many times the
encoding produces unicodes that answer true to isalpha(). Seems to be
working...
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,019
Latest member
RoxannaSta

Latest Threads

Top