Re: catch UnicodeDecodeError

Discussion in 'Python' started by Jaroslav Dobrek, Jul 26, 2012.

  1. On Jul 26, 12:19 pm, wrote:
    > On Thursday, July 26, 2012 9:46:27 AM UTC+2, Jaroslav Dobrek wrote:
    > > On Jul 25, 8:50 pm, Dave Angel <> wrote:
    > > > On 07/25/2012 08:09 AM, wrote:
    > > >
    > > >
    > > >
    > > >
    > > >
    > > >
    > > >
    > > >
    > > >
    > > > > On Wednesday, July 25, 2012 1:35:09 PM UTC+2, Philipp Hagemeister wrote:
    > > > >> Hi Jaroslav,
    > > >
    > > > >> you can catch a UnicodeDecodeError just like any other exception. Can
    > > > >> you provide a full example program that shows your problem?
    > > >
    > > > >> This works fine on my system:
    > > >
    > > > >> import sys
    > > > >> open('tmp', 'wb').write(b'\xff\xff')
    > > > >> try:
    > > > >>     buf = open('tmp', 'rb').read()
    > > > >>     buf.decode('utf-8')
    > > > >> except UnicodeDecodeError as ude:
    > > > >>     sys.exit("Found a bad char in file " + "tmp")
    > > >
    > > > > Thank you. I got it. What I need to do is explicitly decode text.
    > > >
    > > > > But I think trial and error with moving files around will in most cases be faster. Usually, such a problem occurs with some (usually complex) program that I wrote quite a long time ago. I don't like editing old and complex programs that work under all normal circumstances.
    > > >
    > > > > What I am missing (especially for Python3) is something like:
    > > >
    > > > > try:
    > > > >     for line in sys.stdin:
    > > > > except UnicodeDecodeError:
    > > > >     sys.exit("Encoding problem in line " + str(line_number))
    > > >
    > > > > I got the point that there is no such thing as encoding-independent lines. But if no line ending can be found, then the file simply has one single line.
    > > >
    > > > i can't understand your question.  if the problem is that the system
    > > > doesn't magically produce a variable called line_number, then generate
    > > > it yourself, by counting
    > > > in the loop.

    >
    > > That was just a very incomplete and general example.

    >
    > > My problem is solved. What I need to do is explicitly decode text when
    > > reading it. Then I can catch exceptions. I might do this in future
    > > programs.

    >
    > > I dislike about this solution that it complicates most programs
    > > unnecessarily. In programs that open, read and process many files I
    > > don't want to explicitly decode and encode characters all the time.I
    > > just want to write:

    >
    > > for line in f:

    >
    > > or something like that. Yet, writing this means to *implicitly* decode
    > > text. And, because the decoding is implicit, you cannot say

    >
    > > try:
    > >     for line in f: # here text is decoded implicitly
    > >        do_something()
    > > except UnicodeDecodeError():
    > >     do_something_different()

    >
    > > This isn't possible for syntactic reasons.

    >
    > > The problem is that vast majority of the thousands of files that I
    > > process are correctly encoded. But then, suddenly, there is a bad
    > > character in a new file. (This is so because most files today are
    > > generated by people who don't know that there is such a thing as
    > > encodings.) And then I need to rewrite my very complex program just
    > > because of one single character in one single file.

    >
    > In my mind you are taking the problem the wrong way.
    >
    > Basically there is no "real UnicodeDecodeError", you are
    > just wrongly attempting to read a file with the wrong
    > codec. Catching a UnicodeDecodeError will not correct
    > the basic problem, it will "only" show, you are using
    > a wrong codec.
    > There is still the possibility, you have to deal with an
    > ill-formed utf-8 codding, but I doubt it is the case.



    I participate in projects in which all files (raw text files, xml
    files, html files, ...)
    are supposed to be encoded in utf-8. I get many different files from
    many different people.
    They are almost always ancoded in utf-8. But sometimes a whole file
    or, more frequently, parts of a file
    are not encoded in utf-8. The reason is that most of the files stem
    from the internet. Files
    or strings are downloaded and, if possible, recoded. And they are
    often simply concatenated into larger
    strings or files.

    I think the most straightforward thing to do is to assume that I get
    utf-8 and raise an error
    if some file or character proves to be something different.
     
    Jaroslav Dobrek, Jul 26, 2012
    #1
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. John Black
    Replies:
    8
    Views:
    4,202
    Xenos
    Aug 20, 2004
  2. Ruslan
    Replies:
    1
    Views:
    515
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=
    Sep 7, 2004
  3. Adam
    Replies:
    9
    Views:
    598
    red floyd
    Feb 2, 2006
  4. Marteno Rodia

    catch doesn't catch a thrown exception

    Marteno Rodia, Aug 3, 2009, in forum: Java
    Replies:
    5
    Views:
    596
    Daniel Pitts
    Aug 5, 2009
  5. Replies:
    17
    Views:
    1,301
    Robert Miles
    Aug 30, 2012
Loading...

Share This Page