Strange problems with encoding

Sebastian Meyer · Nov 6, 2003

Hi newsgroup,

i am trying to replace german special characters in strings like
str = re.sub('ö', 'oe', str)

When i work with this, i always get the message
UniCode Error: ASCII decoding error : ordinal not in range(128)

Yes i have googled, i searched the faq, manual and python library and
searched all known soruces of information. I played with the python
builtin function encode to enforce the rigth encoding, but the error
stays the same. I ve read a lot about UniCode and internal conversion
about Strings done by python, but somehow i ve missed the clue.
Nope, python says Huuups... ordinal not in range(128), ;-(

Anyone of you having any idea?? Seems like i am too stupid to read
documentation carefully., perhaps i misunderstand something...

thanks for your help in advance

Sebastian

Rudy Schockaert · Nov 6, 2003

Sebastian said:
Hi newsgroup,

i am trying to replace german special characters in strings like
str = re.sub('ö', 'oe', str)

When i work with this, i always get the message
UniCode Error: ASCII decoding error : ordinal not in range(128)

Yes i have googled, i searched the faq, manual and python library and
searched all known soruces of information. I played with the python
builtin function encode to enforce the rigth encoding, but the error
stays the same. I ve read a lot about UniCode and internal conversion
about Strings done by python, but somehow i ve missed the clue.
Nope, python says Huuups... ordinal not in range(128), ;-(

Anyone of you having any idea?? Seems like i am too stupid to read
documentation carefully., perhaps i misunderstand something...

thanks for your help in advance

Sebastian

I'm experiencing something similar for the moment. I try to
base64-encode Unicode strings and I get the exact same errormessage.
Traceback (most recent call last):
File "<interactive input>", line 1, in ?
File "C:\Python23\lib\encodings\base64_codec.py", line 24, in
base64_encode
output = base64.encodestring(input)
File "C:\Python23\lib\base64.py", line 39, in encodestring
pieces.append(binascii.b2a_base64(chunk))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in
position 0: ordinal not in range(128)

When I don't specify it's unicode it works:'9g==\n'

The reason I want to base64-encode these unicode strings is because I
get those as input and want to store them in a MySQL database using
SQLObject.

Michael Hudson · Nov 6, 2003

Sebastian Meyer said:
Hi newsgroup,

i am trying to replace german special characters in strings like
str = re.sub('ö', 'oe', str)

1) str is the name of a builtin -- often a bad idea to use that as a
variable name.

2) I presume `str' is a unicode string? Try writing the literal as
u'ö' instead (and adding the appropriate coding cookie to your
source file if using Python 2.3). Or I guess you could write it

u'\N{LATIN SMALL LETTER O WITH DIAERESIS}'

Cheers,
mwh

Michael Hudson · Nov 6, 2003

Rudy Schockaert said:
I'm experiencing something similar for the moment. I try to
base64-encode Unicode strings and I get the exact same errormessage.

"base64-encoding Unicode strings" is not a particularly well defined
operation. "base64-encoding" is a way of turning *binary data* into a
particularly "safe" sequence of ascii characters.

Unicode (in some sense) is a family of ways of representing strings of
characters as binary data.

So to base-64 encode a Unicode string, you need to choose *which*
member of this family you're going to use, which is to say the
encoding. UTF-8 would seem a good bet.

But...

Traceback (most recent call last):
File "<interactive input>", line 1, in ?
File "C:\Python23\lib\encodings\base64_codec.py", line 24, in
base64_encode
output = base64.encodestring(input)
File "C:\Python23\lib\base64.py", line 39, in encodestring
pieces.append(binascii.b2a_base64(chunk))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in
position 0: ordinal not in range(128)
'w7Y=\n'

When I don't specify it's unicode it works:
'9g==\n'

Well, this works because your terminal seems to be latin-1:
'9g==\n'

What would you like to do with a character that isn't in latin-1?

The reason I want to base64-encode these unicode strings is because I
get those as input and want to store them in a MySQL database using
SQLObject.

! Why can't you just encode them as utf-8 strings? (Or, thinking
about it, why doesn't SQLObject support unicode?)

Cheers,
mwh

Joe Fromm · Nov 6, 2003

Sebastian Meyer said:
Hi newsgroup,

i am trying to replace german special characters in strings like
str = re.sub('ö', 'oe', str)

When i work with this, i always get the message
UniCode Error: ASCII decoding error : ordinal not in range(128)

Try adding

sys.setdefaultencoding( 'latin-1' )

to your site.py module, or rewrite your fragment as

from = 'ö'
to = 'oe'
s = re.sub( from.encode('latin-1'), to.encode('latin-1', s )

If you are running on Windows you might want to change 'latin-1' to 'mbcs',
as that seems to be the most forgiving codec, but it is Windows only.

Joe

Peter Otten · Nov 6, 2003

Sebastian said:
Hi newsgroup,

i am trying to replace german special characters in strings like
str = re.sub('ö', 'oe', str)

When i work with this, i always get the message
UniCode Error: ASCII decoding error : ordinal not in range(128)

Yes i have googled, i searched the faq, manual and python library and
searched all known soruces of information. I played with the python
builtin function encode to enforce the rigth encoding, but the error
stays the same. I ve read a lot about UniCode and internal conversion
about Strings done by python, but somehow i ve missed the clue.
Nope, python says Huuups... ordinal not in range(128), ;-(

Anyone of you having any idea?? Seems like i am too stupid to read
documentation carefully., perhaps i misunderstand something...

thanks for your help in advance

Sebastian

Works here, even with my older snake:

Python 2.2.1 (#1, Sep 10 2002, 17:49:17)
[GCC 3.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.u'Doespaddel'

To provoke a UnicodeError, I have to convert a unicode string with umlauts
to str without providing the encoding:
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeError: ASCII encoding error: ordinal not in range(128)

I suspect that you have something similar hidden in your code (i. e.
characters >= 128 that are not converted). The remedy is to explicitly
decode with the appropriate encoding:

Try to build a minimal script that shows the reported behaviour and fix it
or post it for more detailed advice. By the way, don't use str as a
variable name, it's the type of "ordinary" strings.

Peter

Sebastian Meyer · Nov 6, 2003

1) str is the name of a builtin -- often a bad idea to use that as a
variable name.

it was only the example name for the variable, be sure that dont
use any builtins as variable names
maybe not a good example ... thanks for the hint

2) I presume `str' is a unicode string? Try writing the literal as
u'ö' instead (and adding the appropriate coding cookie to your
source file if using Python 2.3). Or I guess you could write it

u'\N{LATIN SMALL LETTER O WITH DIAERESIS}'

i ll try and report back...

Rudy Schockaert · Nov 6, 2003

Joe said:
Try adding

sys.setdefaultencoding( 'latin-1' )

to your site.py module, or rewrite your fragment as

At the end of site.py you can enable a piece of code that sets your
default encoding to the current locale of your computer:

if 1:
# Enable to support locale aware default string encodings.
import locale
loc = locale.getdefaultlocale()
if loc[1]:
encoding = loc[1]

This works great for me.

Thanks for pointing me to site.py

P.S. I really need some weeks off so I can read all the available
documentation ;-)

Rudy Schockaert · Nov 6, 2003

'w7Y=\n'

This works indeed. And thanks to Joe Fromm's hint (site.py) I don't have
to worry about it anymore.

What would you like to do with a character that isn't in latin-1?

Actually, I don't care as long as the encode and decode on the same
machine give me back the original value.

! Why can't you just encode them as utf-8 strings? (Or, thinking
about it, why doesn't SQLObject support unicode?)

The actual input strings don't really contain unicode text values, but
rather binary values i get as result from calling win32.NetUserEnum.

The manual of SQLObject (great product btw) explains how you can easily
store binary data in a SQL table by encoding it when setting and
decoding it when getting the value. Tha is just what I was trying to do.

Michael Hudson · Nov 6, 2003

Rudy Schockaert said:
This works indeed. And thanks to Joe Fromm's hint (site.py) I don't
have to worry about it anymore.

Well, I'm from the setdefaultencoding-is-evil camp, but it sounds like
you're in a pretty icky situation.

Actually, I don't care as long as the encode and decode on the same
machine give me back the original value.
Huh?

The actual input strings don't really contain unicode text values, but
rather binary values i get as result from calling win32.NetUserEnum.

Oh, so they're not really unicode strings at all? Blech. That's
really really nasty. Binary data should really be represented as
(narrow) strings in Python. Perhaps the utf-16-le codec would be the
most appropriate...

Cheers,
mwh

Rudy Schockaert · Nov 6, 2003

Michael said:
Well, I'm from the setdefaultencoding-is-evil camp, but it sounds like
you're in a pretty icky situation.

I wasn't even aware there are two camps. What would be the reasons not
to use setdefaultencoding? As I configured it now it uses the systems
locale to set the encoding. I'm using the same machine to retrieve data,
manipulate it and store in a database (on the same machine).
I would like to understand what could be wrong in this case.

Huh?

What I mean is that I encode the data when I store it in the DB and
decode it when I retrieve the data from the DB. I do this because
SQLObject doesn't support the binary data. As long as the result that
comes back out is exactly the same as it was when it went in, I don't care.

Oh, so they're not really unicode strings at all? Blech. That's
really really nasty. Binary data should really be represented as
(narrow) strings in Python.

I'm just doing it the easy way, I guess. I get the data from the win32
call as Unicode data, even when it contains binary data. Perhaps that I
will transform this data in a later phase to more usefull format, but
that'll depend on the need.

Perhaps the utf-16-le codec would be the

most appropriate...

This is really not my thing. I noticed that on my system the encoding is
now set to cp1252. What would be the difference if I switched to utf-16-le?

Thanks for your explanation.

Rudy

Sebastian Meyer · Nov 6, 2003

i ll try and report back...

okay, i ve solved my problem... it seems that my method which tries
to insert the data i process into the database raises the error. The
data comes from XML files, my derived xml.sax.handler.ContentHandler
returns UniCode encoded data. The database routine tries to
encode the values as ASCII and --**BOOOM**-- ... Exception.

I now replace the special characters by their UniCode Names
eg. u'\N{LATIN SMALL LETTER O WITH DIAERESIS}' (thanks for the hint
michael), now all for works fine... ;-))

thanks for the great help NG

Sebastian

Fredrik Lundh · Nov 6, 2003

Rudy said:
At the end of site.py you can enable a piece of code that sets your
default encoding to the current locale of your computer:

if 1:
# Enable to support locale aware default string encodings.
import locale
loc = locale.getdefaultlocale()
if loc[1]:
encoding = loc[1]

This works great for me.

instead of hacking your Python installation, I suggest using
explicit calls to the "encode" method wherever you need to
convert from Unicode to binary data on the way out.

P.S. I really need some weeks off so I can read all the available
documentation ;-)

it shouldn't take you more than 15-20 minutes to learn enough
about Unicode to be able to write Python code that processes
non-ASCII text in a reliable and portable way:

short version:
http://effbot.org/zone/unicode-objects.htm

long version:
http://www.joelonsoftware.com/articles/Unicode.html

</F>

Rudy Schockaert · Nov 6, 2003

P.S. I really need some weeks off so I can read all the available

it shouldn't take you more than 15-20 minutes to learn enough
about Unicode to be able to write Python code that processes
non-ASCII text in a reliable and portable way:

short version:
http://effbot.org/zone/unicode-objects.htm

long version:
http://www.joelonsoftware.com/articles/Unicode.html

</F>

I wasn't referring to Unicode ;-) but to the existance of site.py .
There still is so much I have to learn about python that I will need
those weeks badly. I only got halfway in Alex' Python in a Nutshell
(splendid book btw) which I already have since Europython :-(

Martin v. =?iso-8859-15?q?L=F6wis?= · Nov 6, 2003

Rudy Schockaert said:
I wasn't even aware there are two camps. What would be the reasons not
to use setdefaultencoding?

You lose portability (more correctly: you get a false sense of
portability). If you have write an application that requires the
default encoding to be FOO-1, the application may work fine on system
A, and fail on system B. Telling the operator of system B to change
her default encoding may cause breakage of a different application on
system B, as B has BAR-2 as the default encoding; changing it to FOO-1
would break applications that require it to be BAR-2.

IOW, if you require conversions between Unicode and byte strings,
explicitly do them in your code. Explicit is better than implicit.

As I configured it now it uses the systems locale to set the
encoding. I'm using the same machine to retrieve data, manipulate it
and store in a database (on the same machine). I would like to
understand what could be wrong in this case.

If the next user logs in on the same system, and has a different
locale set, that user will misinterpret the data you have created.

What I mean is that I encode the data when I store it in the DB and
decode it when I retrieve the data from the DB. I do this because
SQLObject doesn't support the binary data. As long as the result that
comes back out is exactly the same as it was when it went in, I don't
care.

Then you should *define* an encoding that your application uses,
e.g. UTF-8, and use that encoding throughout whereever required,
instead of having the administrator to ask to change a system setting.

Regards,
Martin

files.py (encoding error)	0	Jun 10, 2013
files.py (weird encoding error)	0	Jun 10, 2013
Encoding trouble when script called from application	0	Jan 14, 2014
newbie with a encoding question, please help	8	Apr 1, 2010
Trouble with UnicodeEncodeError and email	0	Jan 8, 2014
Python 3.3, gettext and Unicode problems	0	Dec 31, 2012
logging of strings with broken encoding	8	Jul 2, 2009
Question of UTF16BE encoding / decoding	2	May 5, 2009

Strange problems with encoding

Sebastian Meyer

Rudy Schockaert

Michael Hudson

Michael Hudson

Joe Fromm

Peter Otten

Sebastian Meyer

Rudy Schockaert

Rudy Schockaert

Michael Hudson

Rudy Schockaert

Sebastian Meyer

Fredrik Lundh

Rudy Schockaert

Martin v. =?iso-8859-15?q?L=F6wis?=

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads