Python3: Sane way to deal with broken encodings

Discussion in 'Python' started by Bruno Desthuilliers, Dec 6, 2009.

  1. Johannes Bauer a écrit :
    > Dear all,
    >
    > I've some applciations which fetch HTML docuemnts off the web, parse
    > their content and do stuff with it. Every once in a while it happens
    > that the web site administrators put up files which are encoded in a
    > wrong manner.
    >
    > Thus my Python script dies a horrible death:
    >
    > File "./update_db", line 67, in <module>
    > for line in open(tempfile, "r"):
    > File "/usr/local/lib/python3.1/codecs.py", line 300, in decode
    > (result, consumed) = self._buffer_decode(data, self.errors, final)
    > UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position
    > 3286: unexpected code byte
    >
    > This is well and ok usually, but I'd like to be able to tell Python:
    > "Don't worry, some idiot encoded that file, just skip over such
    > parts/replace them by some character sequence".
    >
    > Is that possible? If so, how?


    This might get you started:

    """
    >>> help(str.decode)

    decode(...)
    S.decode([encoding[,errors]]) -> object

    Decodes S using the codec registered for encoding. encoding defaults
    to the default encoding. errors may be given to set a different error
    handling scheme. Default is 'strict' meaning that encoding errors raise
    a UnicodeDecodeError. Other possible values are 'ignore' and 'replace'
    as well as any other name registered with codecs.register_error that is
    able to handle UnicodeDecodeErrors.
    """

    HTH
     
    Bruno Desthuilliers, Dec 6, 2009
    #1
    1. Advertising

  2. Dear all,

    I've some applciations which fetch HTML docuemnts off the web, parse
    their content and do stuff with it. Every once in a while it happens
    that the web site administrators put up files which are encoded in a
    wrong manner.

    Thus my Python script dies a horrible death:

    File "./update_db", line 67, in <module>
    for line in open(tempfile, "r"):
    File "/usr/local/lib/python3.1/codecs.py", line 300, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
    UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position
    3286: unexpected code byte

    This is well and ok usually, but I'd like to be able to tell Python:
    "Don't worry, some idiot encoded that file, just skip over such
    parts/replace them by some character sequence".

    Is that possible? If so, how?

    Kind regards,
    Johannes

    --
    "Aus starken Potentialen können starke Erdbeben resultieren; es können
    aber auch kleine entstehen - und "du" wirst es nicht für möglich halten
    (!), doch sieh': Es können dabei auch gar keine Erdbeben resultieren."
    -- "Rüdiger Thomas" alias Thomas Schulz in dsa über seine "Vorhersagen"
    <>
     
    Johannes Bauer, Dec 6, 2009
    #2
    1. Advertising

  3. Bruno Desthuilliers schrieb:

    >> Is that possible? If so, how?

    >
    > This might get you started:
    >
    > """
    >>>> help(str.decode)

    > decode(...)
    > S.decode([encoding[,errors]]) -> object


    Hmm, this would work nicely if I called "decode" explicitly - but what
    I'm doing is:

    #!/usr/bin/python3
    for line in open("broken", "r"):
    pass

    Which still raises the UnicodeDecodeError when I do not even do any
    decoding explicitly. How can I achieve this?

    Kind regards,
    Johannes

    --
    "Aus starken Potentialen können starke Erdbeben resultieren; es können
    aber auch kleine entstehen - und "du" wirst es nicht für möglich halten
    (!), doch sieh': Es können dabei auch gar keine Erdbeben resultieren."
    -- "Rüdiger Thomas" alias Thomas Schulz in dsa über seine "Vorhersagen"
    <>
     
    Johannes Bauer, Dec 7, 2009
    #3
  4. On Mon, Dec 7, 2009 at 2:16 PM, Johannes Bauer <> wrote:
    > Bruno Desthuilliers schrieb:
    >
    >>> Is that possible? If so, how?

    >>
    >> This might get you started:
    >>
    >> """
    >>>>> help(str.decode)

    >> decode(...)
    >>     S.decode([encoding[,errors]]) -> object

    >
    > Hmm, this would work nicely if I called "decode" explicitly - but what
    > I'm doing is:
    >
    > #!/usr/bin/python3
    > for line in open("broken", "r"):
    >        pass
    >
    > Which still raises the UnicodeDecodeError when I do not even do any
    > decoding explicitly. How can I achieve this?
    >
    > Kind regards,
    > Johannes
    >


    Looking at the python 3 docs, it seems that open takes the encoding
    and errors parameters as optional arguments. So you can call
    open('broken', 'r',errors='replace')

    > --
    > "Aus starken Potentialen können starke Erdbeben resultieren; es können
    > aber auch kleine entstehen - und "du" wirst es nicht für möglich halten
    > (!), doch sieh': Es können dabei auch gar keine Erdbeben resultieren."
    > -- "Rüdiger Thomas" alias Thomas Schulz in dsa über seine "Vorhersagen"
    > <>
    > --
    > http://mail.python.org/mailman/listinfo/python-list
    >
     
    Benjamin Kaplan, Dec 7, 2009
    #4
  5. > Thus my Python script dies a horrible death:
    >
    > File "./update_db", line 67, in <module>
    > for line in open(tempfile, "r"):
    > File "/usr/local/lib/python3.1/codecs.py", line 300, in decode
    > (result, consumed) = self._buffer_decode(data, self.errors, final)
    > UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position
    > 3286: unexpected code byte
    >
    > This is well and ok usually, but I'd like to be able to tell Python:
    > "Don't worry, some idiot encoded that file, just skip over such
    > parts/replace them by some character sequence".
    >
    > Is that possible? If so, how?


    As Benjamin says: if you pass errors='replace' to open, then it will
    replace the faulty characters; if you pass errors='ignore', it will
    skip over them.

    Alternatively, you can open the files in binary ('rb'), so that no
    decoding will be attempted at all, or you can specify latin-1 as
    the encoding, which means that you can decode all files successfully
    (though possibly not correctly).

    Regards,
    Martin
     
    Martin v. Loewis, Dec 8, 2009
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Jack Jansen
    Replies:
    0
    Views:
    308
    Jack Jansen
    Sep 16, 2004
  2. GMane Python

    python sane imaging

    GMane Python, Dec 28, 2004, in forum: Python
    Replies:
    0
    Views:
    380
    GMane Python
    Dec 28, 2004
  3. rbt

    deal or no deal

    rbt, Dec 22, 2005, in forum: Python
    Replies:
    7
    Views:
    573
    Duncan Smith
    Dec 28, 2005
  4. Jaroslav Dobrek

    read from file with mixed encodings in Python3

    Jaroslav Dobrek, Nov 7, 2011, in forum: Python
    Replies:
    2
    Views:
    260
    Peter Otten
    Nov 7, 2011
  5. Andrew Berg
    Replies:
    0
    Views:
    347
    Andrew Berg
    Jun 16, 2012
Loading...

Share This Page