Unicode problems, yet again

I

Ivan Voras

I have a string fetched from database, in iso8859-2, with 8bit
characters, and I'm trying to send it over the network, via a socket:

File "E:\Python24\lib\socket.py", line 249, in write
data = str(data) # XXX Should really reject non-string non-buffers
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0161' in
position 123: ordinal not in range(128)

The other end knows it should expect this encoding, so how to send it?

(Does anyone else feel that python's unicode handling is, well...
suboptimal at least?)
 
K

Kent Johnson

Ivan said:
I have a string fetched from database, in iso8859-2, with 8bit
characters, and I'm trying to send it over the network, via a socket:

File "E:\Python24\lib\socket.py", line 249, in write
data = str(data) # XXX Should really reject non-string non-buffers
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0161' in
position 123: ordinal not in range(128)

The other end knows it should expect this encoding, so how to send it?

I think maybe the string from the database is a unicode string, not 8-bit. What happens if you write
data.encode('iso8859-2') ?
(Does anyone else feel that python's unicode handling is, well...
suboptimal at least?)

It can be confusing and surprising, yes. Suboptimal...well, I wouldn't want to say that I could do
better...

Kent
 
J

John Machin

I have a string fetched from database, in iso8859-2, with 8bit
characters,

"8bit characters"?? Maybe you did once, or you thought you did, but
what you have now is a Unicode string, and socket.write() is expecting
an ordinary string.
and I'm trying to send it over the network, via a socket:

File "E:\Python24\lib\socket.py", line 249, in write
data = str(data) # XXX Should really reject non-string non-buffers
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0161' in
position 123: ordinal not in range(128)

Like it says, you have passed it a *UNICODE* string that has u'\u0161'
(the small s with caron) at position 123.
The other end knows it should expect this encoding, so how to send it?

If the other end wants an encoding, then you should *encode* it, like
this:

Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0161' in
position 0: ordinal not in range(128)
(Does anyone else feel that python's unicode handling is, well...
suboptimal at least?)

Your posting gives no evidence for such a conclusion.
 
I

Ivan Voras

John said:
Your posting gives no evidence for such a conclusion.

Sorry, that was just steam venting from my ears - I often get bitten by
the "ordinal not in range(128)" error. :)
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Ivan said:
Sorry, that was just steam venting from my ears - I often get bitten by
the "ordinal not in range(128)" error. :)

I think I'm glad to hear that. Errors should never pass silently, unless
explicitly silenced. When you get that error, it means there is a bug in
your code (just like a ValueError, a TypeError, or an IndexError). The
best way to deal with them is to fix them.

Now, the troubling part is clearly that you are getting *bitten* by
this specific error, and often so. I presume you get other kinds of
errors also often, but they don't bite :) This suggests that you should
really try to understand what the error message is trying to tell so,
and what precisely the underlying error is.

For other errors, you have already come to an understanding what they
mean: NameError, ah, there must be a typo. AttributeError on None, ah,
forgot to check for a None result somewhere. ordinal not in range(128),
hmm, let's try different variations of the code and see which ones
work. This is going to continue biting you until you really understand
what it means.

The most "sane" mental model (and architecture) is one where you always
have Unicode strings in your code, and decode/encode only at system
interfaces (sockets, databases, ...). It turns out that the database
you use already follows this strategy (i.e. it decodes for you), so
you now only need to design the other interfaces so it is clear when
you have Unicode characters and when you have bytes.

Regards,
Martin
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top