Detect character encoding

Michal · Dec 4, 2005

Hello,
is there any way how to detect string encoding in Python?

I need to proccess several files. Each of them could be encoded in
different charset (iso-8859-2, cp1250, etc). I want to detect it, and
encode it to utf-8 (with string function encode).

Thank you for any answer
Regards
Michal

Scott David Daniels · Dec 4, 2005

Michal said:
Hello,
is there any way how to detect string encoding in Python?

I need to proccess several files. Each of them could be encoded in
different charset (iso-8859-2, cp1250, etc). I want to detect it, and
encode it to utf-8 (with string function encode).

Thank you for any answer
Regards
Michal

The two ways to detect a string's encoding are:
(1) know the encoding ahead of time
(2) guess correctly

This is the whole point of Unicode -- an encoding that works for _lots_
of languages.

--Scott David Daniels
(e-mail address removed)

Diez B. Roggisch · Dec 4, 2005

Michal said:
Hello,
is there any way how to detect string encoding in Python?

I need to proccess several files. Each of them could be encoded in
different charset (iso-8859-2, cp1250, etc). I want to detect it, and
encode it to utf-8 (with string function encode).

You can only guess, by e.g. looking for words that contain e.g. umlauts.
Recode might be of help here, it has such heuristics built in AFAIK.

But there is _no_ way to be absolutely sure. 8bit are 8bit, so each file
is "legal" in all encodings.

Diez

Mike Meyer · Dec 4, 2005

Diez B. Roggisch said:
But there is _no_ way to be absolutely sure. 8bit are 8bit, so each
file is "legal" in all encodings.

Not quite. Some encodings don't use all the valid 8-bit characters, so
if you encounter a character not in an encoding, you can eliminate it
from the list of possible encodings. This doesn't really help much by
itself, though.

<mike

Nemesis · Dec 4, 2005

Mentre io pensavo ad una intro simpatica "Michal" scriveva:

Hello,
is there any way how to detect string encoding in Python?
I need to proccess several files. Each of them could be encoded in
different charset (iso-8859-2, cp1250, etc). I want to detect it, and
encode it to utf-8 (with string function encode).
Thank you for any answer

Hi,
As you already heard you can't be sure but you can guess.

I use a method like this:

def guess_encoding(text):
for best_enc in guess_list:
try:
unicode(text,best_enc,"strict")
except:
pass
else:
break
return best_enc

'guess_list' is an ordered charset name list like this:

['us-ascii','iso-8859-1','iso-8859-2',...,'windows-1250','windows-1252'...]

of course you can remove charsets you are sure you'll never find.
--
Questa potrebbe davvero essere la scintilla che fa traboccare la
goccia.

|\ | |HomePage : http://nem01.altervista.org
| \|emesis |XPN (my nr): http://xpn.altervista.org

B Mahoney · Dec 4, 2005

You may want to look at some Python Cookbook recipes, such as
http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52257
"Auto-detect XML encoding" by Paul Prescod

Martin P. Hellwig · Dec 4, 2005

Mike said:
Not quite. Some encodings don't use all the valid 8-bit characters, so
if you encounter a character not in an encoding, you can eliminate it
from the list of possible encodings. This doesn't really help much by
itself, though.

<mike

I read or heard (can't remember the origin) that MS IE has a quite good
implementation of guessing the language en character encoding of web
pages when there not or falsely specified.
From what I can remember is that they used an algorithm to create some
statistics of the specific page and compared that with statistic about
all kinds of languages and encodings and just mapped the most likely.

Please be aware that I don't know if the above has even the slightest
amount of truth in it, however it didn't prevent me from posting anyway ;-)

skip · Dec 4, 2005

Martin> I read or heard (can't remember the origin) that MS IE has a
Martin> quite good implementation of guessing the language en character
Martin> encoding of web pages when there not or falsely specified.

Gee, that's nice. Too bad the source isn't available... <0.5 wink>

Skip

Diez B. Roggisch · Dec 4, 2005

Mike said:
Not quite. Some encodings don't use all the valid 8-bit characters, so
if you encounter a character not in an encoding, you can eliminate it
from the list of possible encodings. This doesn't really help much by
itself, though.

----- test.py
for enc in ["cp1250", "latin1", "iso-8859-2"]:
print enc
try:
str.decode("".join([chr(i) for i in xrange(256)]), enc)
except UnicodeDecodeError, e:
print e
-----

192:~ deets$ python2.4 /tmp/test.py
cp1250
'charmap' codec can't decode byte 0x81 in position 129: character maps
to <undefined>
latin1
iso-8859-2

So cp1250 doesn't have all codepoints defined - but the others have.
Sure, this helps you to eliminate 1 of the three choices the OP wanted
to choose between - but how many texts you have that have a 129 in them?

Regards,

Diez

=?iso-8859-1?Q?Fran=E7ois?= Pinard · Dec 5, 2005

[Diez B. Roggisch]

Michal wrote:

Recode might be of help here, it has such heuristics built in AFAIK.

If we are speaking about the same Recode â˜º, there are some built in
tools that could help a human to discover a charset, but this requires
work and time, and is far from fully automated as one might dream.
While some charsets could be guessed almost correctly by automatic
means, most are difficult to recognise. The whole problem is not easy.

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Dec 5, 2005

Martin said:
From what I can remember is that they used an algorithm to create some
statistics of the specific page and compared that with statistic about
all kinds of languages and encodings and just mapped the most likely.

More hearsay: I believe language-based heuristics are common. You first
guess an encoding based on the bytes you see, then guess a language of
the page. If you then get a lot of characters that should not appear
in texts of the language (e.g. a lot of umlaut characters in a French
page), you know your guess was wrong, and you try a different language
for that encoding. If you run out of languages, you guess a different
encoding.

Mozilla can guess the encoding if you tell it what the language is,
which sounds like a similar approach.

Regards,
Martin

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Dec 5, 2005

Diez said:
So cp1250 doesn't have all codepoints defined - but the others have.
Sure, this helps you to eliminate 1 of the three choices the OP wanted
to choose between - but how many texts you have that have a 129 in them?

For the iso8859 ones, you should assume that the characters in
range(128, 160) really aren't used. If you get one of these, and it is
not utf-8, it is a Windows code page.

UTF-8 can be recognized pretty reliable: even though it allows all bytes
to appear, it is very constraint in what sequences of bytes it allows.
E.g. you can't have a single byte >127 in UTF-8; you need atleast two
of them subsequent, and they need to meet more constraints.

Regards,
Martin

Kent Johnson · Dec 5, 2005

Martin said:
I read or heard (can't remember the origin) that MS IE has a quite good
implementation of guessing the language en character encoding of web
pages when there not or falsely specified.

Yes, I think that's right. In my experience MS Word does a very good job
of guessing the encoding of text files.

Kent

The new guy · Dec 6, 2005

Michal said:
Hello,
is there any way how to detect string encoding in Python?

I need to proccess several files. Each of them could be encoded in
different charset (iso-8859-2, cp1250, etc). I want to detect it, and
encode it to utf-8 (with string function encode).

Well, about how to detect it in Python, I can't help. My first guess,
though, would be to have a look at the source code of the "file" utility.
This is an example of what it does:

# ls
de.i18n en.i18n
# file *
de.i18n: ISO-8859 text, with very long lines
en.i18n: ISO-8859 English text, with very long lines

cheers

how to detect the character encoding in a web page ?	20	Dec 24, 2012
email 8bit encoding	6	Jul 29, 2013
files.py (encoding error)	0	Jun 10, 2013
encoding error	1	Feb 20, 2013
files.py (weird encoding error)	0	Jun 10, 2013
SimpleXmlRpcServer and character encoding	3	Oct 9, 2008
encoding error in python 27	4	Feb 22, 2013
xml.dom.minidom character encoding	6	Apr 21, 2010

Detect character encoding

Michal

Scott David Daniels

Diez B. Roggisch

Mike Meyer

Nemesis

B Mahoney

Martin P. Hellwig

skip

Diez B. Roggisch

=?iso-8859-1?Q?Fran=E7ois?= Pinard

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Kent Johnson

The new guy

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads