Strange problems with encoding

Discussion in 'Python' started by Sebastian Meyer, Nov 6, 2003.

  1. Hi newsgroup,

    i am trying to replace german special characters in strings like
    str = re.sub('ö', 'oe', str)

    When i work with this, i always get the message
    UniCode Error: ASCII decoding error : ordinal not in range(128)

    Yes i have googled, i searched the faq, manual and python library and
    searched all known soruces of information. I played with the python
    builtin function encode to enforce the rigth encoding, but the error
    stays the same. I ve read a lot about UniCode and internal conversion
    about Strings done by python, but somehow i ve missed the clue.
    Nope, python says Huuups... ordinal not in range(128), ;-(

    Anyone of you having any idea?? Seems like i am too stupid to read
    documentation carefully., perhaps i misunderstand something...

    thanks for your help in advance

    Sebastian
     
    Sebastian Meyer, Nov 6, 2003
    #1
    1. Advertising

  2. Sebastian Meyer wrote:

    > Hi newsgroup,
    >
    > i am trying to replace german special characters in strings like
    > str = re.sub('ö', 'oe', str)
    >
    > When i work with this, i always get the message
    > UniCode Error: ASCII decoding error : ordinal not in range(128)
    >
    > Yes i have googled, i searched the faq, manual and python library and
    > searched all known soruces of information. I played with the python
    > builtin function encode to enforce the rigth encoding, but the error
    > stays the same. I ve read a lot about UniCode and internal conversion
    > about Strings done by python, but somehow i ve missed the clue.
    > Nope, python says Huuups... ordinal not in range(128), ;-(
    >
    > Anyone of you having any idea?? Seems like i am too stupid to read
    > documentation carefully., perhaps i misunderstand something...
    >
    > thanks for your help in advance
    >
    > Sebastian


    I'm experiencing something similar for the moment. I try to
    base64-encode Unicode strings and I get the exact same errormessage.

    >>> s = u'ö'
    >>> s

    u'\xf6'
    >>> s.encode('base64')

    Traceback (most recent call last):
    File "<interactive input>", line 1, in ?
    File "C:\Python23\lib\encodings\base64_codec.py", line 24, in
    base64_encode
    output = base64.encodestring(input)
    File "C:\Python23\lib\base64.py", line 39, in encodestring
    pieces.append(binascii.b2a_base64(chunk))
    UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in
    position 0: ordinal not in range(128)

    When I don't specify it's unicode it works:
    >>> s = 'ö'
    >>> s

    '\xf6'
    >>> s.encode('base64')

    '9g==\n'

    The reason I want to base64-encode these unicode strings is because I
    get those as input and want to store them in a MySQL database using
    SQLObject.
     
    Rudy Schockaert, Nov 6, 2003
    #2
    1. Advertising

  3. "Sebastian Meyer" <> writes:

    > Hi newsgroup,
    >
    > i am trying to replace german special characters in strings like
    > str = re.sub('ö', 'oe', str)


    1) str is the name of a builtin -- often a bad idea to use that as a
    variable name.

    2) I presume `str' is a unicode string? Try writing the literal as
    u'ö' instead (and adding the appropriate coding cookie to your
    source file if using Python 2.3). Or I guess you could write it

    u'\N{LATIN SMALL LETTER O WITH DIAERESIS}'

    Cheers,
    mwh

    --
    Usenet is like a herd of performing elephants with diarrhea --
    massive, difficult to redirect, awe-inspiring, entertaining, and
    a source of mind-boggling amounts of excrement when you least
    expect it. -- spaf (1992)
     
    Michael Hudson, Nov 6, 2003
    #3
  4. Rudy Schockaert <> writes:

    > Sebastian Meyer wrote:
    >
    > > Hi newsgroup,
    > > i am trying to replace german special characters in strings like
    > > str = re.sub('ö', 'oe', str)
    > > When i work with this, i always get the message
    > > UniCode Error: ASCII decoding error : ordinal not in range(128)
    > > Yes i have googled, i searched the faq, manual and python library
    > > and
    > > searched all known soruces of information. I played with the python
    > > builtin function encode to enforce the rigth encoding, but the error
    > > stays the same. I ve read a lot about UniCode and internal conversion
    > > about Strings done by python, but somehow i ve missed the clue.
    > > Nope, python says Huuups... ordinal not in range(128), ;-(
    > > Anyone of you having any idea?? Seems like i am too stupid to read
    > > documentation carefully., perhaps i misunderstand something...
    > > thanks for your help in advance
    > > Sebastian

    >
    > I'm experiencing something similar for the moment. I try to
    > base64-encode Unicode strings and I get the exact same errormessage.


    "base64-encoding Unicode strings" is not a particularly well defined
    operation. "base64-encoding" is a way of turning *binary data* into a
    particularly "safe" sequence of ascii characters.

    Unicode (in some sense) is a family of ways of representing strings of
    characters as binary data.

    So to base-64 encode a Unicode string, you need to choose *which*
    member of this family you're going to use, which is to say the
    encoding. UTF-8 would seem a good bet.

    But...

    > >>> s = u'ö'
    > >>> s

    > u'\xf6'
    > >>> s.encode('base64')

    > Traceback (most recent call last):
    > File "<interactive input>", line 1, in ?
    > File "C:\Python23\lib\encodings\base64_codec.py", line 24, in
    > base64_encode
    > output = base64.encodestring(input)
    > File "C:\Python23\lib\base64.py", line 39, in encodestring
    > pieces.append(binascii.b2a_base64(chunk))
    > UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in
    > position 0: ordinal not in range(128)


    >>> u'ö'.encode('utf-8').encode('base64')

    'w7Y=\n'

    > When I don't specify it's unicode it works:
    > >>> s = 'ö'
    > >>> s

    > '\xf6'
    > >>> s.encode('base64')

    > '9g==\n'


    Well, this works because your terminal seems to be latin-1:

    >>> u'ö'.encode('latin-1').encode('base64')

    '9g==\n'

    What would you like to do with a character that isn't in latin-1?

    > The reason I want to base64-encode these unicode strings is because I
    > get those as input and want to store them in a MySQL database using
    > SQLObject.


    ! Why can't you just encode them as utf-8 strings? (Or, thinking
    about it, why doesn't SQLObject support unicode?)

    Cheers,
    mwh

    --
    I think if we have the choice, I'd rather we didn't explicitly put
    flaws in the reST syntax for the sole purpose of not insulting the
    almighty. -- /will on the doc-sig
     
    Michael Hudson, Nov 6, 2003
    #4
  5. Sebastian Meyer

    Joe Fromm Guest

    "Sebastian Meyer" <> wrote in message
    news:p...
    > Hi newsgroup,
    >
    > i am trying to replace german special characters in strings like
    > str = re.sub('ö', 'oe', str)
    >
    > When i work with this, i always get the message
    > UniCode Error: ASCII decoding error : ordinal not in range(128)
    >


    Try adding

    sys.setdefaultencoding( 'latin-1' )

    to your site.py module, or rewrite your fragment as

    from = 'ö'
    to = 'oe'
    s = re.sub( from.encode('latin-1'), to.encode('latin-1', s )

    If you are running on Windows you might want to change 'latin-1' to 'mbcs',
    as that seems to be the most forgiving codec, but it is Windows only.

    Joe
     
    Joe Fromm, Nov 6, 2003
    #5
  6. Sebastian Meyer

    Peter Otten Guest

    Sebastian Meyer wrote:

    > Hi newsgroup,
    >
    > i am trying to replace german special characters in strings like
    > str = re.sub('ö', 'oe', str)
    >
    > When i work with this, i always get the message
    > UniCode Error: ASCII decoding error : ordinal not in range(128)
    >
    > Yes i have googled, i searched the faq, manual and python library and
    > searched all known soruces of information. I played with the python
    > builtin function encode to enforce the rigth encoding, but the error
    > stays the same. I ve read a lot about UniCode and internal conversion
    > about Strings done by python, but somehow i ve missed the clue.
    > Nope, python says Huuups... ordinal not in range(128), ;-(
    >
    > Anyone of you having any idea?? Seems like i am too stupid to read
    > documentation carefully., perhaps i misunderstand something...
    >
    > thanks for your help in advance
    >
    > Sebastian


    Works here, even with my older snake:

    Python 2.2.1 (#1, Sep 10 2002, 17:49:17)
    [GCC 3.2] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import re
    >>> re.sub("ö", "oe", "Döspaddel")

    'Doespaddel'
    >>> re.sub("ö", "oe", u"Döspaddel")

    u'Doespaddel'
    >>> re.sub("ö", u"oe", u"Döspaddel")

    u'Doespaddel'
    >>> re.sub(u"ö", u"oe", u"Döspaddel")

    u'Doespaddel'

    To provoke a UnicodeError, I have to convert a unicode string with umlauts
    to str without providing the encoding:

    >>> str(u"Döspaddel")

    Traceback (most recent call last):
    File "<stdin>", line 1, in ?
    UnicodeError: ASCII encoding error: ordinal not in range(128)

    I suspect that you have something similar hidden in your code (i. e.
    characters >= 128 that are not converted). The remedy is to explicitly
    decode with the appropriate encoding:

    >>> u"Döspaddel".encode("latin-1")

    'D\xf6spaddel'
    >>>


    Try to build a minimal script that shows the reported behaviour and fix it
    or post it for more detailed advice. By the way, don't use str as a
    variable name, it's the type of "ordinary" strings.

    Peter
     
    Peter Otten, Nov 6, 2003
    #6
  7. On Thu, 06 Nov 2003 13:39:25 +0000, Michael Hudson wrote:

    > "Sebastian Meyer" <> writes:
    >
    >> Hi newsgroup,
    >>
    >> i am trying to replace german special characters in strings like
    >> str = re.sub('ö', 'oe', str)

    >
    > 1) str is the name of a builtin -- often a bad idea to use that as a
    > variable name.


    it was only the example name for the variable, be sure that dont
    use any builtins as variable names
    maybe not a good example ... thanks for the hint

    >
    > 2) I presume `str' is a unicode string? Try writing the literal as
    > u'ö' instead (and adding the appropriate coding cookie to your
    > source file if using Python 2.3). Or I guess you could write it
    >
    > u'\N{LATIN SMALL LETTER O WITH DIAERESIS}'


    i ll try and report back...

    >
    > Cheers,
    > mwh
     
    Sebastian Meyer, Nov 6, 2003
    #7
  8. Joe Fromm wrote:

    >
    > Try adding
    >
    > sys.setdefaultencoding( 'latin-1' )
    >
    > to your site.py module, or rewrite your fragment as
    >

    At the end of site.py you can enable a piece of code that sets your
    default encoding to the current locale of your computer:

    if 1:
    # Enable to support locale aware default string encodings.
    import locale
    loc = locale.getdefaultlocale()
    if loc[1]:
    encoding = loc[1]

    This works great for me.

    Thanks for pointing me to site.py

    P.S. I really need some weeks off so I can read all the available
    documentation ;-)
     
    Rudy Schockaert, Nov 6, 2003
    #8
  9. >
    >>>>u'ö'.encode('utf-8').encode('base64')

    >
    > 'w7Y=\n'


    This works indeed. And thanks to Joe Fromm's hint (site.py) I don't have
    to worry about it anymore.
    >
    > What would you like to do with a character that isn't in latin-1?
    >

    Actually, I don't care as long as the encode and decode on the same
    machine give me back the original value.
    >
    >>The reason I want to base64-encode these unicode strings is because I
    >>get those as input and want to store them in a MySQL database using
    >>SQLObject.

    >
    >
    > ! Why can't you just encode them as utf-8 strings? (Or, thinking
    > about it, why doesn't SQLObject support unicode?)
    >


    The actual input strings don't really contain unicode text values, but
    rather binary values i get as result from calling win32.NetUserEnum.

    The manual of SQLObject (great product btw) explains how you can easily
    store binary data in a SQL table by encoding it when setting and
    decoding it when getting the value. Tha is just what I was trying to do.
     
    Rudy Schockaert, Nov 6, 2003
    #9
  10. Rudy Schockaert <> writes:

    > >
    > >>>>u'ö'.encode('utf-8').encode('base64')

    > > 'w7Y=\n'

    >
    > This works indeed. And thanks to Joe Fromm's hint (site.py) I don't
    > have to worry about it anymore.


    Well, I'm from the setdefaultencoding-is-evil camp, but it sounds like
    you're in a pretty icky situation.

    > > What would you like to do with a character that isn't in latin-1?
    > >

    > Actually, I don't care as long as the encode and decode on the same
    > machine give me back the original value.


    Huh?

    > >>The reason I want to base64-encode these unicode strings is because I
    > >>get those as input and want to store them in a MySQL database using
    > >>SQLObject.

    > > ! Why can't you just encode them as utf-8 strings? (Or, thinking
    > > about it, why doesn't SQLObject support unicode?)
    > >

    >
    > The actual input strings don't really contain unicode text values, but
    > rather binary values i get as result from calling win32.NetUserEnum.


    Oh, so they're not really unicode strings at all? Blech. That's
    really really nasty. Binary data should really be represented as
    (narrow) strings in Python. Perhaps the utf-16-le codec would be the
    most appropriate...

    Cheers,
    mwh

    --
    Q: What are 1000 lawyers at the bottom of the ocean?
    A: A good start.
    (A lawyer told me this joke.)
    -- Michael Ströder, comp.lang.python
     
    Michael Hudson, Nov 6, 2003
    #10
  11. Michael Hudson wrote:

    >
    > Well, I'm from the setdefaultencoding-is-evil camp, but it sounds like
    > you're in a pretty icky situation.
    >

    I wasn't even aware there are two camps. What would be the reasons not
    to use setdefaultencoding? As I configured it now it uses the systems
    locale to set the encoding. I'm using the same machine to retrieve data,
    manipulate it and store in a database (on the same machine).
    I would like to understand what could be wrong in this case.

    >>
    >>Actually, I don't care as long as the encode and decode on the same
    >>machine give me back the original value.

    >
    >
    > Huh?
    >

    What I mean is that I encode the data when I store it in the DB and
    decode it when I retrieve the data from the DB. I do this because
    SQLObject doesn't support the binary data. As long as the result that
    comes back out is exactly the same as it was when it went in, I don't care.

    >
    >>>>The reason I want to base64-encode these unicode strings is because I
    >>>>get those as input and want to store them in a MySQL database using
    >>>>SQLObject.
    >>>
    >>>! Why can't you just encode them as utf-8 strings? (Or, thinking
    >>>about it, why doesn't SQLObject support unicode?)
    >>>

    >>
    >>The actual input strings don't really contain unicode text values, but
    >>rather binary values i get as result from calling win32.NetUserEnum.

    >
    >
    > Oh, so they're not really unicode strings at all? Blech. That's
    > really really nasty. Binary data should really be represented as
    > (narrow) strings in Python.

    I'm just doing it the easy way, I guess. I get the data from the win32
    call as Unicode data, even when it contains binary data. Perhaps that I
    will transform this data in a later phase to more usefull format, but
    that'll depend on the need.

    Perhaps the utf-16-le codec would be the
    > most appropriate...
    >

    This is really not my thing. I noticed that on my system the encoding is
    now set to cp1252. What would be the difference if I switched to utf-16-le?

    Thanks for your explanation.

    Rudy
     
    Rudy Schockaert, Nov 6, 2003
    #11
  12. On Thu, 06 Nov 2003 15:10:49 +0100, Sebastian Meyer wrote:

    > On Thu, 06 Nov 2003 13:39:25 +0000, Michael Hudson wrote:
    >
    >> 2) I presume `str' is a unicode string? Try writing the literal as
    >> u'ö' instead (and adding the appropriate coding cookie to your
    >> source file if using Python 2.3). Or I guess you could write it
    >>
    >> u'\N{LATIN SMALL LETTER O WITH DIAERESIS}'

    >
    > i ll try and report back...


    okay, i ve solved my problem... it seems that my method which tries
    to insert the data i process into the database raises the error. The
    data comes from XML files, my derived xml.sax.handler.ContentHandler
    returns UniCode encoded data. The database routine tries to
    encode the values as ASCII and --**BOOOM**-- ... Exception.

    I now replace the special characters by their UniCode Names
    eg. u'\N{LATIN SMALL LETTER O WITH DIAERESIS}' (thanks for the hint
    michael), now all for works fine... ;-))

    thanks for the great help NG

    Sebastian
     
    Sebastian Meyer, Nov 6, 2003
    #12
  13. Rudy Schockaert wrote:

    > At the end of site.py you can enable a piece of code that sets your
    > default encoding to the current locale of your computer:
    >
    > if 1:
    > # Enable to support locale aware default string encodings.
    > import locale
    > loc = locale.getdefaultlocale()
    > if loc[1]:
    > encoding = loc[1]
    >
    > This works great for me.


    instead of hacking your Python installation, I suggest using
    explicit calls to the "encode" method wherever you need to
    convert from Unicode to binary data on the way out.

    > P.S. I really need some weeks off so I can read all the available
    > documentation ;-)


    it shouldn't take you more than 15-20 minutes to learn enough
    about Unicode to be able to write Python code that processes
    non-ASCII text in a reliable and portable way:

    short version:
    http://effbot.org/zone/unicode-objects.htm

    long version:
    http://www.joelonsoftware.com/articles/Unicode.html

    </F>
     
    Fredrik Lundh, Nov 6, 2003
    #13
  14. >>P.S. I really need some weeks off so I can read all the available
    >>documentation ;-)

    >
    >
    > it shouldn't take you more than 15-20 minutes to learn enough
    > about Unicode to be able to write Python code that processes
    > non-ASCII text in a reliable and portable way:
    >
    > short version:
    > http://effbot.org/zone/unicode-objects.htm
    >
    > long version:
    > http://www.joelonsoftware.com/articles/Unicode.html
    >
    > </F>


    I wasn't referring to Unicode ;-) but to the existance of site.py .
    There still is so much I have to learn about python that I will need
    those weeks badly. I only got halfway in Alex' Python in a Nutshell
    (splendid book btw) which I already have since Europython :-(
     
    Rudy Schockaert, Nov 6, 2003
    #14
  15. Rudy Schockaert <> writes:

    > I wasn't even aware there are two camps. What would be the reasons not
    > to use setdefaultencoding?


    You lose portability (more correctly: you get a false sense of
    portability). If you have write an application that requires the
    default encoding to be FOO-1, the application may work fine on system
    A, and fail on system B. Telling the operator of system B to change
    her default encoding may cause breakage of a different application on
    system B, as B has BAR-2 as the default encoding; changing it to FOO-1
    would break applications that require it to be BAR-2.

    IOW, if you require conversions between Unicode and byte strings,
    explicitly do them in your code. Explicit is better than implicit.

    > As I configured it now it uses the systems locale to set the
    > encoding. I'm using the same machine to retrieve data, manipulate it
    > and store in a database (on the same machine). I would like to
    > understand what could be wrong in this case.


    If the next user logs in on the same system, and has a different
    locale set, that user will misinterpret the data you have created.

    > What I mean is that I encode the data when I store it in the DB and
    > decode it when I retrieve the data from the DB. I do this because
    > SQLObject doesn't support the binary data. As long as the result that
    > comes back out is exactly the same as it was when it went in, I don't
    > care.


    Then you should *define* an encoding that your application uses,
    e.g. UTF-8, and use that encoding throughout whereever required,
    instead of having the administrator to ask to change a system setting.

    Regards,
    Martin
     
    Martin v. =?iso-8859-15?q?L=F6wis?=, Nov 6, 2003
    #15
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Antonio

    Strange encoding behaviour

    Antonio, Dec 29, 2004, in forum: ASP .Net
    Replies:
    0
    Views:
    433
    Antonio
    Dec 29, 2004
  2. Hardy Wang

    Encoding.Default and Encoding.UTF8

    Hardy Wang, Jun 8, 2004, in forum: ASP .Net
    Replies:
    5
    Views:
    18,931
    Jon Skeet [C# MVP]
    Jun 9, 2004
  3. Antonio

    Strange encoding behaviour

    Antonio, Dec 29, 2004, in forum: ASP .Net
    Replies:
    0
    Views:
    313
    Antonio
    Dec 29, 2004
  4. Replies:
    1
    Views:
    23,441
    Real Gagnon
    Oct 8, 2004
  5. Replies:
    2
    Views:
    388
Loading...

Share This Page