unicode codecs

Discussion in 'Python' started by Ivan Voras, Feb 9, 2004.

  1. Ivan Voras

    Ivan Voras Guest

    When concatenating strings (actually, a constant and a string...) i get
    the following error:

    UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 1:
    ordinal not in range(128)

    Now I don't think either string is unicode, but I'm working with
    win32api so it might be... :) The point is: I know all values will fit
    in a particular code page (iso-8859-2), so how do I change the 'ascii'
    codec in the above error into something that will work?
    Ivan Voras, Feb 9, 2004
    #1
    1. Advertising

  2. On Mon, 09 Feb 2004 21:59:36 +0100, Ivan Voras
    <ivoras@__geri.cc.fer.hr> wrote:

    >When concatenating strings (actually, a constant and a string...) i get
    >the following error:
    >
    >UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 1:
    >ordinal not in range(128)
    >
    >Now I don't think either string is unicode, but I'm working with
    >win32api so it might be... :) The point is: I know all values will fit
    >in a particular code page (iso-8859-2), so how do I change the 'ascii'
    >codec in the above error into something that will work?


    To get a real solution, you should also post the offending code, but
    you might try to convert your values to unicode with the built-in
    unicode() and the string method decode(). See the library reference
    sections 2.1 and 2.2.6.

    --
    Christopher
    Christopher Koppler, Feb 9, 2004
    #2
    1. Advertising

  3. Ivan Voras

    Ivan Voras Guest

    Christopher Koppler wrote:

    > To get a real solution, you should also post the offending code, but
    > you might try to convert your values to unicode with the built-in
    > unicode() and the string method decode(). See the library reference
    > sections 2.1 and 2.2.6.


    I tried that, without luck. It is somewhat difficult to reproduce the
    problem, but here's how I see it:

    - win32api function returns a string (8bit) with some of the characters
    from the upper half of code page, let's call it s1
    - a statement such as a='x'+s1 fails with the above error.

    I don't really know why should concatenation check if characters are
    7-bit clean (or indeed if they represent anything in whatever code page).

    Since win32api functions exist also in unicode version, I tried this:

    - call the unicode version of function. Returned is a unicode string
    (checked, it really is unicode) like u'R\xfcgenwald.txt', let's call it s2
    - a statement a='x'+s2.encode('iso-8859-2') also fails with the exact
    same error.

    It is strange that if I execute similar code in Idle (e.g. manually
    assigning string constants to variables and concatenating), everything
    works!

    The exact error is:
    File "E:\develop\pynetdb\netdbcreate.py", line 32, in walkdirs
    fullname = root+'\\'+filename
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1:
    ordinal not in range(128)

    The filename variable contains (in my latest effort) utf-8 encoded value
    'R\xc3\xbcgenwald.mp3', and root variable contains a normal non-unicode
    string.

    I tried various combinations of unicode and non-unicode types, and thay
    all fail sooner or later when they meet with a non-unicode string that
    is not 7-bit clean.
    Ivan Voras, Feb 9, 2004
    #3
  4. Ivan Voras wrote:
    > When concatenating strings (actually, a constant and a string...) i get
    > the following error:
    >
    > UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 1:
    > ordinal not in range(128)
    >
    > Now I don't think either string is unicode


    This statement must be false. When concatenating two byte strings, no
    codec is ever used. So, either
    1. one of the strings is a Unicode objects, or
    2. you are not performing concatenation, or you get the exception
    from an operation that is not concatenation, or
    3. you are not getting this exception.

    Most likely, it is 1)

    > The point is: I know all values will fit
    > in a particular code page (iso-8859-2), so how do I change the 'ascii'
    > codec in the above error into something that will work?


    Explicitly encode the Unicode string in your concatenation as
    iso-8859-2.

    Regards,
    Martin
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=, Feb 9, 2004
    #4
  5. Ivan Voras wrote:
    > - win32api function returns a string (8bit) with some of the characters
    > from the upper half of code page, let's call it s1


    Are you absolutely certain that type(s1) is str?

    > - a statement such as a='x'+s1 fails with the above error.


    Are you absolutely certain the constant is the literal string 'x'?

    > I don't really know why should concatenation check if characters are
    > 7-bit clean (or indeed if they represent anything in whatever code page).


    As you have shown, there would be no need, and indeed, Python will not
    check code pages in this case. So you must be doing something else.

    > - call the unicode version of function. Returned is a unicode string
    > (checked, it really is unicode) like u'R\xfcgenwald.txt', let's call it s2
    > - a statement a='x'+s2.encode('iso-8859-2') also fails with the exact
    > same error.


    How do you know it is the concatenation that causes the exception?

    > The exact error is:
    > File "E:\develop\pynetdb\netdbcreate.py", line 32, in walkdirs
    > fullname = root+'\\'+filename
    > UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1:
    > ordinal not in range(128)
    >
    > The filename variable contains (in my latest effort) utf-8 encoded value
    > 'R\xc3\xbcgenwald.mp3', and root variable contains a normal non-unicode
    > string.


    Which string precisely (what is its repr())?

    Regards,
    Martin
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=, Feb 9, 2004
    #5
  6. Ivan Voras

    Peter Otten Guest

    Ivan Voras wrote:

    > When concatenating strings (actually, a constant and a string...) i get
    > the following error:
    >
    > UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 1:
    > ordinal not in range(128)
    >
    > Now I don't think either string is unicode, but I'm working with
    > win32api so it might be... :) The point is: I know all values will fit
    > in a particular code page (iso-8859-2), so how do I change the 'ascii'
    > codec in the above error into something that will work?


    You can either convert all strings to unicode or to iso-8859-2.
    A hands on approach:

    >>> u,s

    (u'R\xfcbe', 'R\xfcbe')
    >>> u+s

    Traceback (most recent call last):
    File "<stdin>", line 1, in ?
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 1:
    ordinal not in range(128)

    This error is prevented by an explicit conversion:

    >>> u.encode("iso-8859-1") + s

    'R\xfcbeR\xfcbe'

    or

    >>> u + s.decode("iso-8859-1")

    u'R\xfcbeR\xfcbe'

    If you aren't sure which string is unicode and which is not:

    >>> def toiso(s):

    .... if isinstance(s, unicode):
    .... return u.encode("iso-8859-1")
    .... return s
    ....
    >>> toiso(u) + toiso(s)

    'R\xfcbeR\xfcbe'

    Peter
    Peter Otten, Feb 9, 2004
    #6
  7. Ivan Voras

    Peter Otten Guest

    Peter Otten wrote:

    >>>> def toiso(s):

    > ... if isinstance(s, unicode):
    > ... return u.encode("iso-8859-1")
    > ... return s
    > ...
    >>>> toiso(u) + toiso(s)

    > 'R\xfcbeR\xfcbe'


    Oops, that should be:

    >>> def toiso(t):

    .... if isinstance(t, unicode):
    .... return t.encode("iso-8859-1")
    .... return t
    ....
    >>> toiso(u) + toiso(s)

    'R\xfcbeR\xfcbe'
    Peter Otten, Feb 9, 2004
    #7
  8. Ivan Voras

    Ivan Voras Guest

    Martin v. Löwis wrote:
    > Ivan Voras wrote:
    >
    >> - win32api function returns a string (8bit) with some of the
    >> characters from the upper half of code page, let's call it s1

    >
    >
    > Are you absolutely certain that type(s1) is str?


    Yes. Plain string.

    >> - a statement such as a='x'+s1 fails with the above error.

    >
    >
    > Are you absolutely certain the constant is the literal string 'x'?


    Um, what else could it be? This is an example, in the real case the
    literal string is something else (but of the same format).


    >> - call the unicode version of function. Returned is a unicode string
    >> (checked, it really is unicode) like u'R\xfcgenwald.txt', let's call
    >> it s2
    >> - a statement a='x'+s2.encode('iso-8859-2') also fails with the exact
    >> same error.

    >
    > How do you know it is the concatenation that causes the exception?


    What else could cause it? It's a simple command, nothing fancy - an
    concatenation and assignment.

    I've tried converting everything to use unicode and I'm getting *really*
    weird results now - it may be a bug in the win32api library.
    Ivan Voras, Feb 9, 2004
    #8
  9. Ivan Voras

    Ivan Voras Guest

    Peter Otten wrote:

    > You can either convert all strings to unicode or to iso-8859-2.
    > A hands on approach:
    >
    >
    >>>>u,s

    >
    > (u'R\xfcbe', 'R\xfcbe')
    >
    >>>>u+s

    >
    > Traceback (most recent call last):
    > File "<stdin>", line 1, in ?
    > UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 1:
    > ordinal not in range(128)
    >
    > This error is prevented by an explicit conversion:


    Thank you - I eventually found that out the hard way :) It was a mix of
    some bugs from my code and the win32api library code, and I was seeing
    exeptions pop up from both of them depending on what conditions were
    met. Eventually I seem to have found a workaround for the library bugs
    but I don't like it - it's a mixup of using unicode and code-page and
    converting around when necessary. The good thing is that it doesn't seem
    to influence performance a lot...

    (Apparently, win32file.FindFilesW does something with its parameter that
    breaks with above error when the parameter is unicode.)

    Thanks for the help, all!
    Ivan Voras, Feb 9, 2004
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Marko Faldix
    Replies:
    8
    Views:
    412
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=
    Dec 15, 2003
  2. Steven Bethard

    singing the praises of unicode and codecs

    Steven Bethard, Dec 10, 2004, in forum: Python
    Replies:
    0
    Views:
    295
    Steven Bethard
    Dec 10, 2004
  3. aurora
    Replies:
    2
    Views:
    550
    aurora
    Jan 14, 2006
  4. Karl Knechtel
    Replies:
    2
    Views:
    363
    Walter Dörwald
    Jul 10, 2012
  5. Alan Franzoni
    Replies:
    0
    Views:
    201
    Alan Franzoni
    Jul 27, 2012
Loading...

Share This Page