Unicode/ascii encoding nightmare

T

Thomas W

I'm getting really annoyed with python in regards to
unicode/ascii-encoding problems.

The string below is the encoding of the norwegian word "fødselsdag".

I stored the string as "fødselsdag" but somewhere in my code it got
translated into the mess above and I cannot get the original string
back. It cannot be printed in the console or written a plain text-file.
I've tried to convert it using
Traceback (most recent call last):
File "<interactive input>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1:
ordinal not in range(128)
Traceback (most recent call last):
File "<interactive input>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1:
ordinal not in range(128)

And nothing helps. I cannot remember hacing these problems in earlier
versions of python and it's really annoying, even if it's my own fault
somehow, handling of normal characters like this shouldn't cause this
much hassle. Searching google for "codec can't decode byte" and
UnicodeDecodeError etc. produces a bunch of hits so it's obvious I'm
not alone.

Any hints?
 
M

Mark Peters

The string below is the encoding of the norwegian word "fødselsdag".
I'm not sure which encoding method you used to get the string above.
Here's the result of my playing with the string in IDLE:
 
R

Robert Kern

Thomas said:
I'm getting really annoyed with python in regards to
unicode/ascii-encoding problems.

The string below is the encoding of the norwegian word "fødselsdag".


I stored the string as "fødselsdag" but somewhere in my code it got
translated into the mess above and I cannot get the original string
back. It cannot be printed in the console or written a plain text-file.
I've tried to convert it using

Traceback (most recent call last):
File "<interactive input>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1:
ordinal not in range(128)

Traceback (most recent call last):
File "<interactive input>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1:
ordinal not in range(128)

And nothing helps. I cannot remember hacing these problems in earlier
versions of python and it's really annoying, even if it's my own fault
somehow, handling of normal characters like this shouldn't cause this
much hassle. Searching google for "codec can't decode byte" and
UnicodeDecodeError etc. produces a bunch of hits so it's obvious I'm
not alone.

You would want .decode() (which converts a byte string into a Unicode string),
not .encode() (which converts a Unicode string into a byte string). You get
UnicodeDecodeErrors even though you are trying to .encode() because whenever
Python is expecting a Unicode string but gets a byte string, it tries to decode
the byte string as 7-bit ASCII. If that fails, then it raises a UnicodeDecodeError.

However, I don't know of an encoding that takes u"fødselsdag" to
'f\xc3\x83\xc2\xb8dselsdag'.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco
 
J

John Machin

Thomas said:
I'm getting really annoyed with python in regards to
unicode/ascii-encoding problems.

The string below is the encoding of the norwegian word "fødselsdag".

There is no such thing as "*the* encoding" of any given string.
I stored the string as "fødselsdag" but somewhere in my code it got
translated into the mess above and I cannot get the original string
back.

Somewhere in your code??? Can't you track through your code to see
where it is being changed? Failing that, can't you show us your code so
that we can help you?

I have guessed *what* you got, but *how* you got it boggles the mind:

The effect is the same as (decode from latin1 to Unicode, encode as
utf8) *TWICE*. That's how you change one byte in the original to *FOUR*
bytes in the "mess":

| >>> orig = 'f\xf8dselsdag'
| >>> orig.decode('latin1').encode('utf8')
| 'f\xc3\xb8dselsdag'
| >>>
orig.decode('latin1').encode('utf8').decode('latin1').encode('utf8')
| 'f\xc3\x83\xc2\xb8dselsdag'
| >>>
It cannot be printed in the console or written a plain text-file.

Incorrect. *Any* string can be printed on the console or written to a
file. What you mean is that when you look at the output, it is not what
you want.
I've tried to convert it using

Traceback (most recent call last):
File "<interactive input>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1:
ordinal not in range(128)

encode is an attribute of unicode objects. If applied to a str object,
the str object is converted to unicode first using the default codec
(typically ascii).

s.encode('iso-8859-1') is effectively
s.decode('ascii').encode('iso-8859-1'), and s.decode('ascii') fails for
the (obvious(?)) reason given.
Traceback (most recent call last):
File "<interactive input>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1:
ordinal not in range(128)

Same story as for 'iso-8859-1'
And nothing helps. I cannot remember hacing these problems in earlier
versions of python

I would be very surprised if you couldn't reproduce your problem on any
2.n version of Python.
and it's really annoying, even if it's my own fault
somehow, handling of normal characters like this shouldn't cause this
much hassle. Searching google for "codec can't decode byte" and
UnicodeDecodeError etc. produces a bunch of hits so it's obvious I'm
not alone.

Any hints?

1. Read the Unicode howto: http://www.amk.ca/python/howto/unicode
2. Read the Python documentation on .decode() and .encode() carefully.
3. Show us your code so that we can help you avoid the double
conversion to utf8. Tell us what IDE you are using.
4. Tell us what you are trying to achieve. Note that if all you are
trying to do is read and write text in Norwegian (or any other language
that's representable in iso-8859-1 aka latin1), then you don't have to
do anything special at all in your code-- this is the good old "legacy"
way of doing things universally in vogue before Unicode was invented!

HTH,
John
 
J

John Machin

Robert said:
However, I don't know of an encoding that takes u"fødselsdag" to
'f\xc3\x83\xc2\xb8dselsdag'.

There isn't one.

C3 and C2 hint at UTF-8.
The fact that C3 and C2 are both present, plus the fact that one
non-ASCII byte has morphoploded into 4 bytes indicate a double whammy.

Cheers,
John
 
G

Georg Brandl

Thomas said:
I'm getting really annoyed with python in regards to
unicode/ascii-encoding problems.

The string below is the encoding of the norwegian word "fødselsdag".

Which encoding is this?
I stored the string as "fødselsdag" but somewhere in my code it got

You stored it where?
translated into the mess above and I cannot get the original string
back. It cannot be printed in the console or written a plain text-file.
I've tried to convert it using

Traceback (most recent call last):
File "<interactive input>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1:
ordinal not in range(128)

Note that "encode" on a string object is often an indication for an error.
The encoding direction (for "normal" encodings, not special things like
the "zlib" codec) is as follows:

encode: from Unicode
decode: to Unicode

(the encode method of strings first DEcodes the string with the default
encoding, which is normally ascii, then ENcodes it with the given encoding)
Traceback (most recent call last):
File "<interactive input>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1:
ordinal not in range(128)

And nothing helps. I cannot remember hacing these problems in earlier
versions of python and it's really annoying, even if it's my own fault
somehow, handling of normal characters like this shouldn't cause this
much hassle. Searching google for "codec can't decode byte" and
UnicodeDecodeError etc. produces a bunch of hits so it's obvious I'm
not alone.

Unicode causes many problems if not used properly. If you want to use Unicode
strings, use them everywhere in your Python application, decode input as early
as possible, and encode output only before writing it to a file or another
program.

Georg
 
A

Andrea Griffini

John said:
The fact that C3 and C2 are both present, plus the fact that one
non-ASCII byte has morphoploded into 4 bytes indicate a double whammy.
Indeed...
'f\xc3\x83\xc2\xb8dselsdag'

Andrea
 
T

Thomas W

Ok, I've cleaned up my code abit and it seems as if I've
encoded/decoded myself into a corner ;-). My understanding of unicode
has room for improvement, that's for sure. I got some pointers and
initial code-cleanup seem to have removed some of the strange results I
got, which several of you also pointed out.

Anyway, thanks for all your replies. I think I can get this thing up
and running with a bit more code tinkering. And I'll read up on some
unicode-docs as well. :) Thanks again.

Thomas
 
J

John Machin

Thomas said:
Ok, I've cleaned up my code abit and it seems as if I've
encoded/decoded myself into a corner ;-). My understanding of unicode
has room for improvement, that's for sure. I got some pointers and
initial code-cleanup seem to have removed some of the strange results I
got, which several of you also pointed out.

Anyway, thanks for all your replies. I think I can get this thing up
and running with a bit more code tinkering. And I'll read up on some
unicode-docs as well. :) Thanks again.

I strongly suggest that you read the docs *FIRST*, and don't "tinker"
at all.

HTH,
John
 
J

John Machin

Andrea said:
'f\xc3\x83\xc2\xb8dselsdag'

Indeed yourself. Have you ever considered reading posts in
chronological order, or reading all posts in a thread? It might help
you avoid writing posts with non-zero information content.

Cheers,
John
 
R

Robert Kern

John said:
Indeed yourself. Have you ever considered reading posts in
chronological order, or reading all posts in a thread?

That presumes that messages arrive in chronological order and transmissions are
instantaneous. Neither are true.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco
 
G

Gabriel Genellina

That presumes that messages arrive in chronological order and
transmissions are
instantaneous. Neither are true.

Sometimes I even got the replies *before* the original post comes.


--
Gabriel Genellina
Softlab SRL

__________________________________________________
Correo Yahoo!
Espacio para todos tus mensajes, antivirus y antispam ¡gratis!
¡Abrí tu cuenta ya! - http://correo.yahoo.com.ar
 
J

John Machin

Gabriel said:
Sometimes I even got the replies *before* the original post comes.

What is in question is the likelihood that message B can appear before
message A, when both emanate from the same source, and B was sent about
7 minutes after A.
 
H

Hendrik van Rooyen

8<---------------------------------------
I strongly suggest that you read the docs *FIRST*, and don't "tinker"
at all.

HTH,
John

This is *good* advice - its unlikely to be followed though, as the OP is prolly
just like most of us - you unpack the stuff out of the box and start assembling
it, and only towards the end, when it wont fit together, do you read the manual
to see where you went wrong...

- Hendrik
 
A

Andrea Griffini

John said:
Indeed yourself.

What does the above mean ?
Have you ever considered reading posts in
chronological order, or reading all posts in a thread?

I do no think people read posts in chronological order;
it simply doesn't make sense. I also don't think many
do read threads completely, but only until the issue is
clear or boredom kicks in.

Your nice "double whammy" post was enough to clarify
what happened to the OP, I just wanted to make a bit
more explicit what you meant; my poor english also
made me understand that you were just "suspecting"
such an error, so I verified and posted the result.

That your "suspect" was a sarcastic remark could be
clear only when reading the timewise "former" reply
that however happened to be lower in the thread tree
in my newsreader; fact that pushed it into the "not
worth reading" area.
It might help
you avoid writing posts with non-zero information content.

Why should I *avoid* writing posts with *non-zero*
information content ? Double whammy on negation or
still my poor english kicking in ? :)

Suppose you didn't post the double whammy message,
and suppose someone else made it seven minutes later
than your other post. I suppose that in this case
the message would be a zero content noise (and not
the precious pearl of wisdom it is because it
comes from you).
Cheers,
John

Andrea
 
P

Paul Boddie

Thomas said:
Ok, I've cleaned up my code abit and it seems as if I've
encoded/decoded myself into a corner ;-).

Yes, you may encounter situations where you have some string, you
"decode" it (ie. convert it to Unicode) using one character encoding,
but then you later "encode" it (ie. convert it back to a plain string)
using a different character encoding. This isn't a problem on its own,
but if you then take that plain string and attempt to convert it to
Unicode again, using the same input encoding as before, you'll be
misinterpreting the contents of the string.

This "round tripping" of character data is typical of Web applications:
you emit a Web page in one encoding, the fields in the forms are
represented in that encoding, and upon form submission you receive this
data. If you then process the form data using a different encoding,
you're misinterpreting what you previously emitted, and when you emit
this data again, you compound the error.
My understanding of unicode has room for improvement, that's for sure. I got some pointers
and initial code-cleanup seem to have removed some of the strange results I got, which
several of you also pointed out.

Converting to Unicode for processing is a "best practice" that you seem
to have adopted, but it's vital that you use character encodings
consistently. One trick, that can be used to mitigate situations where
you have less control over the encoding of data given to you, is to
attempt to convert to Unicode using an encoding that is "conservative"
with regard to acceptable combinations of byte sequences, such as
UTF-8; if such a conversion fails, it's quite possible that another
encoding applies, such as ISO-8859-1, and you can try that. Since
ISO-8859-1 is a "liberal" encoding, in the sense that any byte value
or combination of byte values is acceptable, it should only be used as
a last resort.

However, it's best to have a high level of control over character
encodings rather than using tricks to avoid considering representation
issues carefully.

Paul
 
C

Cliff Wells

8<---------------------------------------
This is *good* advice - its unlikely to be followed though, as the OP is prolly
just like most of us - you unpack the stuff out of the box and start assembling
it, and only towards the end, when it wont fit together, do you read the manual
to see where you went wrong...

I fall right into this camp(fire). I'm always amazed and awed at people
who actually read the docs *thoroughly* before starting. I know some
people do but frankly, unless it's a step-by-step tutorial, I rarely
read the docs beyond getting a basic understanding of what something
does before I start "tinkering".

I've always been a firm believer in the Chinese proverb:

I hear and I forget
I see and I remember
I do and I understand

Of course, I usually just skip straight to the third step and try to
work backwards as needed. This usually works pretty well but when it
doesn't it fails horribly. Unfortunately (for me), working from step
one rarely works at all, so that's the boat I'm stuck in.

I've always been a bit miffed at the RTFM crowd (and somewhat jealous, I
admit). I *do* RTFM, but as often as not the fine manual confuses as
much as clarifies. I'm not convinced this is the result of poor
documentation so much as that I personally have a different mental
approach to problem-solving than the others who find documentation
universally enlightening. I also suspect that I'm not alone in my
approach and that the RTFM crowd is more than a little close-minded
about how others might think about and approach solving problems and
understanding concepts.

Also, much documentation (including the Python docs) tends to be
reference-manual style. This is great if you *already* understand the
problem and just need details, but does about as much for
*understanding* as a dictionary does for learning a language. When I'm
perusing the Python reference manual, I usually find that 10 lines of
example code are worth 1000 lines of function descriptions and
cross-references.

Just my $0.02.

Regards,
Cliff
 
C

Cliff Wells

What is in question is the likelihood that message B can appear before
message A, when both emanate from the same source, and B was sent about
7 minutes after A.

Usenet, email, usenet/email gateways, internet in general... all in
all, pretty likely. I've often seen replies to my posts long before my
own post shows up. In fact, I've seen posts not show up for several
hours.

Regards,
Cliff
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,764
Messages
2,569,565
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top