Python3: Sane way to deal with broken encodings

Bruno Desthuilliers · Dec 6, 2009

Johannes Bauer a écrit :

Dear all,

I've some applciations which fetch HTML docuemnts off the web, parse
their content and do stuff with it. Every once in a while it happens
that the web site administrators put up files which are encoded in a
wrong manner.

Thus my Python script dies a horrible death:

File "./update_db", line 67, in <module>
for line in open(tempfile, "r"):
File "/usr/local/lib/python3.1/codecs.py", line 300, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position
3286: unexpected code byte

This is well and ok usually, but I'd like to be able to tell Python:
"Don't worry, some idiot encoded that file, just skip over such
parts/replace them by some character sequence".

Is that possible? If so, how?

This might get you started:

"""decode(...)
S.decode([encoding[,errors]]) -> object

Decodes S using the codec registered for encoding. encoding defaults
to the default encoding. errors may be given to set a different error
handling scheme. Default is 'strict' meaning that encoding errors raise
a UnicodeDecodeError. Other possible values are 'ignore' and 'replace'
as well as any other name registered with codecs.register_error that is
able to handle UnicodeDecodeErrors.
"""

HTH

Johannes Bauer · Dec 6, 2009

Dear all,

I've some applciations which fetch HTML docuemnts off the web, parse
their content and do stuff with it. Every once in a while it happens
that the web site administrators put up files which are encoded in a
wrong manner.

Thus my Python script dies a horrible death:

File "./update_db", line 67, in <module>
for line in open(tempfile, "r"):
File "/usr/local/lib/python3.1/codecs.py", line 300, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position
3286: unexpected code byte

This is well and ok usually, but I'd like to be able to tell Python:
"Don't worry, some idiot encoded that file, just skip over such
parts/replace them by some character sequence".

Is that possible? If so, how?

Kind regards,
Johannes

--
"Aus starken Potentialen können starke Erdbeben resultieren; es können
aber auch kleine entstehen - und "du" wirst es nicht für möglich halten
(!), doch sieh': Es können dabei auch gar keine Erdbeben resultieren."
-- "Rüdiger Thomas" alias Thomas Schulz in dsa über seine "Vorhersagen"
<1a30da36-68a2-4977-9eed-154265b17d28@q14g2000vbi.googlegroups.com>

Johannes Bauer · Dec 7, 2009

Bruno said:
Is that possible? If so, how?

Click to expand...

This might get you started:

"""decode(...)
S.decode([encoding[,errors]]) -> object

Hmm, this would work nicely if I called "decode" explicitly - but what
I'm doing is:

#!/usr/bin/python3
for line in open("broken", "r"):
pass

Which still raises the UnicodeDecodeError when I do not even do any
decoding explicitly. How can I achieve this?

Kind regards,
Johannes

--
"Aus starken Potentialen können starke Erdbeben resultieren; es können
aber auch kleine entstehen - und "du" wirst es nicht für möglich halten
(!), doch sieh': Es können dabei auch gar keine Erdbeben resultieren."
-- "Rüdiger Thomas" alias Thomas Schulz in dsa über seine "Vorhersagen"
<1a30da36-68a2-4977-9eed-154265b17d28@q14g2000vbi.googlegroups.com>

Benjamin Kaplan · Dec 7, 2009

Bruno said:
Bruno said:

Is that possible? If so, how?

Click to expand...

This might get you started:

"""

help(str.decode)

Click to expand...

decode(...)
S.decode([encoding[,errors]]) -> object

Click to expand...

Hmm, this would work nicely if I called "decode" explicitly - but what
I'm doing is:

#!/usr/bin/python3
for line in open("broken", "r"):
pass

Which still raises the UnicodeDecodeError when I do not even do any
decoding explicitly. How can I achieve this?

Kind regards,
Johannes

Looking at the python 3 docs, it seems that open takes the encoding
and errors parameters as optional arguments. So you can call
open('broken', 'r',errors='replace')

Martin v. Loewis · Dec 8, 2009

Thus my Python script dies a horrible death:

File "./update_db", line 67, in <module>
for line in open(tempfile, "r"):
File "/usr/local/lib/python3.1/codecs.py", line 300, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position
3286: unexpected code byte

This is well and ok usually, but I'd like to be able to tell Python:
"Don't worry, some idiot encoded that file, just skip over such
parts/replace them by some character sequence".

Is that possible? If so, how?

As Benjamin says: if you pass errors='replace' to open, then it will
replace the faulty characters; if you pass errors='ignore', it will
skip over them.

Alternatively, you can open the files in binary ('rb'), so that no
decoding will be attempted at all, or you can specify latin-1 as
the encoding, which means that you can decode all files successfully
(though possibly not correctly).

Regards,
Martin

Converting a pickle to python3	1	Jun 1, 2010
UnicodeDecodeError, how to elegantly deal with this?	3	Aug 4, 2008
io module and pdf question	2	Jun 25, 2013
[email protected]	0	Jan 14, 2014
regular expression, unicode	1	Apr 29, 2009
logging of strings with broken encoding	8	Jul 2, 2009
Encoding trouble when script called from application	0	Jan 14, 2014
regular expression, unicode	0	Apr 29, 2009

Python3: Sane way to deal with broken encodings

Bruno Desthuilliers

Johannes Bauer

Johannes Bauer

Benjamin Kaplan

Martin v. Loewis

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads