Unicode/ascii encoding nightmare

Thomas W · Nov 6, 2006

I'm getting really annoyed with python in regards to
unicode/ascii-encoding problems.

The string below is the encoding of the norwegian word "fødselsdag".

I stored the string as "fødselsdag" but somewhere in my code it got
translated into the mess above and I cannot get the original string
back. It cannot be printed in the console or written a plain text-file.
I've tried to convert it using
Traceback (most recent call last):
File "<interactive input>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1:
ordinal not in range(128)
Traceback (most recent call last):
File "<interactive input>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1:
ordinal not in range(128)

And nothing helps. I cannot remember hacing these problems in earlier
versions of python and it's really annoying, even if it's my own fault
somehow, handling of normal characters like this shouldn't cause this
much hassle. Searching google for "codec can't decode byte" and
UnicodeDecodeError etc. produces a bunch of hits so it's obvious I'm
not alone.

Any hints?

Mark Peters · Nov 6, 2006

The string below is the encoding of the norwegian word "fødselsdag".
I'm not sure which encoding method you used to get the string above.
Here's the result of my playing with the string in IDLE:

Robert Kern · Nov 6, 2006

Thomas said:
I'm getting really annoyed with python in regards to
unicode/ascii-encoding problems.

The string below is the encoding of the norwegian word "fÃ¸dselsdag".

I stored the string as "fÃ¸dselsdag" but somewhere in my code it got
translated into the mess above and I cannot get the original string
back. It cannot be printed in the console or written a plain text-file.
I've tried to convert it using

Traceback (most recent call last):
File "<interactive input>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1:
ordinal not in range(128)

Traceback (most recent call last):
File "<interactive input>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1:
ordinal not in range(128)

And nothing helps. I cannot remember hacing these problems in earlier
versions of python and it's really annoying, even if it's my own fault
somehow, handling of normal characters like this shouldn't cause this
much hassle. Searching google for "codec can't decode byte" and
UnicodeDecodeError etc. produces a bunch of hits so it's obvious I'm
not alone.

You would want .decode() (which converts a byte string into a Unicode string),
not .encode() (which converts a Unicode string into a byte string). You get
UnicodeDecodeErrors even though you are trying to .encode() because whenever
Python is expecting a Unicode string but gets a byte string, it tries to decode
the byte string as 7-bit ASCII. If that fails, then it raises a UnicodeDecodeError.

However, I don't know of an encoding that takes u"fÃ¸dselsdag" to
'f\xc3\x83\xc2\xb8dselsdag'.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco

John Machin · Nov 6, 2006

Thomas said:
I'm getting really annoyed with python in regards to
unicode/ascii-encoding problems.

The string below is the encoding of the norwegian word "fødselsdag".

There is no such thing as "*the* encoding" of any given string.

I stored the string as "fødselsdag" but somewhere in my code it got
translated into the mess above and I cannot get the original string
back.

Somewhere in your code??? Can't you track through your code to see
where it is being changed? Failing that, can't you show us your code so
that we can help you?

I have guessed *what* you got, but *how* you got it boggles the mind:

The effect is the same as (decode from latin1 to Unicode, encode as
utf8) *TWICE*. That's how you change one byte in the original to *FOUR*
bytes in the "mess":

| >>> orig = 'f\xf8dselsdag'
| >>> orig.decode('latin1').encode('utf8')
| 'f\xc3\xb8dselsdag'
| >>>
orig.decode('latin1').encode('utf8').decode('latin1').encode('utf8')
| 'f\xc3\x83\xc2\xb8dselsdag'
| >>>

It cannot be printed in the console or written a plain text-file.

Incorrect. *Any* string can be printed on the console or written to a
file. What you mean is that when you look at the output, it is not what
you want.

I've tried to convert it using

Traceback (most recent call last):
File "<interactive input>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1:
ordinal not in range(128)

encode is an attribute of unicode objects. If applied to a str object,
the str object is converted to unicode first using the default codec
(typically ascii).

s.encode('iso-8859-1') is effectively
s.decode('ascii').encode('iso-8859-1'), and s.decode('ascii') fails for
the (obvious(?)) reason given.

Traceback (most recent call last):
File "<interactive input>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1:
ordinal not in range(128)

Same story as for 'iso-8859-1'

And nothing helps. I cannot remember hacing these problems in earlier
versions of python

I would be very surprised if you couldn't reproduce your problem on any
2.n version of Python.

and it's really annoying, even if it's my own fault
somehow, handling of normal characters like this shouldn't cause this
much hassle. Searching google for "codec can't decode byte" and
UnicodeDecodeError etc. produces a bunch of hits so it's obvious I'm
not alone.

Any hints?

1. Read the Unicode howto: http://www.amk.ca/python/howto/unicode
2. Read the Python documentation on .decode() and .encode() carefully.
3. Show us your code so that we can help you avoid the double
conversion to utf8. Tell us what IDE you are using.
4. Tell us what you are trying to achieve. Note that if all you are
trying to do is read and write text in Norwegian (or any other language
that's representable in iso-8859-1 aka latin1), then you don't have to
do anything special at all in your code-- this is the good old "legacy"
way of doing things universally in vogue before Unicode was invented!

HTH,
John

John Machin · Nov 6, 2006

Robert said:
However, I don't know of an encoding that takes u"fødselsdag" to
'f\xc3\x83\xc2\xb8dselsdag'.

There isn't one.

C3 and C2 hint at UTF-8.
The fact that C3 and C2 are both present, plus the fact that one
non-ASCII byte has morphoploded into 4 bytes indicate a double whammy.

Cheers,
John

Georg Brandl · Nov 6, 2006

Thomas said:
I'm getting really annoyed with python in regards to
unicode/ascii-encoding problems.

The string below is the encoding of the norwegian word "fødselsdag".

Which encoding is this?

I stored the string as "fødselsdag" but somewhere in my code it got

You stored it where?

translated into the mess above and I cannot get the original string
back. It cannot be printed in the console or written a plain text-file.
I've tried to convert it using

Traceback (most recent call last):
File "<interactive input>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1:
ordinal not in range(128)

Note that "encode" on a string object is often an indication for an error.
The encoding direction (for "normal" encodings, not special things like
the "zlib" codec) is as follows:

encode: from Unicode
decode: to Unicode

(the encode method of strings first DEcodes the string with the default
encoding, which is normally ascii, then ENcodes it with the given encoding)

Traceback (most recent call last):
File "<interactive input>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1:
ordinal not in range(128)

And nothing helps. I cannot remember hacing these problems in earlier
versions of python and it's really annoying, even if it's my own fault
somehow, handling of normal characters like this shouldn't cause this
much hassle. Searching google for "codec can't decode byte" and
UnicodeDecodeError etc. produces a bunch of hits so it's obvious I'm
not alone.

Unicode causes many problems if not used properly. If you want to use Unicode
strings, use them everywhere in your Python application, decode input as early
as possible, and encode output only before writing it to a file or another
program.

Georg

Andrea Griffini · Nov 6, 2006

John said:
The fact that C3 and C2 are both present, plus the fact that one
non-ASCII byte has morphoploded into 4 bytes indicate a double whammy.
Indeed...

'f\xc3\x83\xc2\xb8dselsdag'

Andrea

Thomas W · Nov 6, 2006

Ok, I've cleaned up my code abit and it seems as if I've
encoded/decoded myself into a corner ;-). My understanding of unicode
has room for improvement, that's for sure. I got some pointers and
initial code-cleanup seem to have removed some of the strange results I
got, which several of you also pointed out.

Anyway, thanks for all your replies. I think I can get this thing up
and running with a bit more code tinkering. And I'll read up on some
unicode-docs as well.

Thanks again.

Thomas

John Machin · Nov 6, 2006

Thomas said:
Ok, I've cleaned up my code abit and it seems as if I've
encoded/decoded myself into a corner ;-). My understanding of unicode
has room for improvement, that's for sure. I got some pointers and
initial code-cleanup seem to have removed some of the strange results I
got, which several of you also pointed out.

Anyway, thanks for all your replies. I think I can get this thing up
and running with a bit more code tinkering. And I'll read up on some
unicode-docs as well. Thanks again.

I strongly suggest that you read the docs *FIRST*, and don't "tinker"
at all.

HTH,
John

John Machin · Nov 6, 2006

Andrea said:
'f\xc3\x83\xc2\xb8dselsdag'

Indeed yourself. Have you ever considered reading posts in
chronological order, or reading all posts in a thread? It might help
you avoid writing posts with non-zero information content.

Cheers,
John

Robert Kern · Nov 6, 2006

John said:
Indeed yourself. Have you ever considered reading posts in
chronological order, or reading all posts in a thread?

That presumes that messages arrive in chronological order and transmissions are
instantaneous. Neither are true.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco

Gabriel Genellina · Nov 6, 2006

That presumes that messages arrive in chronological order and
transmissions are
instantaneous. Neither are true.

Sometimes I even got the replies *before* the original post comes.

--
Gabriel Genellina
Softlab SRL

__________________________________________________
Correo Yahoo!
Espacio para todos tus mensajes, antivirus y antispam ¡gratis!
¡Abrí tu cuenta ya! - http://correo.yahoo.com.ar

John Machin · Nov 6, 2006

Gabriel said:
Sometimes I even got the replies *before* the original post comes.

What is in question is the likelihood that message B can appear before
message A, when both emanate from the same source, and B was sent about
7 minutes after A.

Cameron Laird · Nov 7, 2006

I strongly suggest that you read the docs *FIRST*, and don't "tinker"
at all.

.
.
.
Does
<URL: http://www.unixreview.com/documents/s=10102/ur0611d/ur0610d.htm >
help?

John Machin · Nov 7, 2006

Cameron said:
.
.
.
Does
<URL: http://www.unixreview.com/documents/s=10102/ur0611d/ur0610d.htm >
help?

Hi Cameron,
Yes.
Sorry for short reply -- gotta run to bookshop

Cheers,
John

Hendrik van Rooyen · Nov 7, 2006

8<---------------------------------------

I strongly suggest that you read the docs *FIRST*, and don't "tinker"
at all.

HTH,
John

This is *good* advice - its unlikely to be followed though, as the OP is prolly
just like most of us - you unpack the stuff out of the box and start assembling
it, and only towards the end, when it wont fit together, do you read the manual
to see where you went wrong...

- Hendrik

Andrea Griffini · Nov 7, 2006

John said:
Indeed yourself.

What does the above mean ?

Have you ever considered reading posts in
chronological order, or reading all posts in a thread?

I do no think people read posts in chronological order;
it simply doesn't make sense. I also don't think many
do read threads completely, but only until the issue is
clear or boredom kicks in.

Your nice "double whammy" post was enough to clarify
what happened to the OP, I just wanted to make a bit
more explicit what you meant; my poor english also
made me understand that you were just "suspecting"
such an error, so I verified and posted the result.

That your "suspect" was a sarcastic remark could be
clear only when reading the timewise "former" reply
that however happened to be lower in the thread tree
in my newsreader; fact that pushed it into the "not
worth reading" area.

It might help
you avoid writing posts with non-zero information content.

Why should I *avoid* writing posts with *non-zero*
information content ? Double whammy on negation or
still my poor english kicking in ?

Suppose you didn't post the double whammy message,
and suppose someone else made it seven minutes later
than your other post. I suppose that in this case
the message would be a zero content noise (and not
the precious pearl of wisdom it is because it
comes from you).

Cheers,
John

Andrea

Paul Boddie · Nov 7, 2006

Thomas said:
Ok, I've cleaned up my code abit and it seems as if I've
encoded/decoded myself into a corner ;-).

Yes, you may encounter situations where you have some string, you
"decode" it (ie. convert it to Unicode) using one character encoding,
but then you later "encode" it (ie. convert it back to a plain string)
using a different character encoding. This isn't a problem on its own,
but if you then take that plain string and attempt to convert it to
Unicode again, using the same input encoding as before, you'll be
misinterpreting the contents of the string.

This "round tripping" of character data is typical of Web applications:
you emit a Web page in one encoding, the fields in the forms are
represented in that encoding, and upon form submission you receive this
data. If you then process the form data using a different encoding,
you're misinterpreting what you previously emitted, and when you emit
this data again, you compound the error.

My understanding of unicode has room for improvement, that's for sure. I got some pointers
and initial code-cleanup seem to have removed some of the strange results I got, which
several of you also pointed out.

Converting to Unicode for processing is a "best practice" that you seem
to have adopted, but it's vital that you use character encodings
consistently. One trick, that can be used to mitigate situations where
you have less control over the encoding of data given to you, is to
attempt to convert to Unicode using an encoding that is "conservative"
with regard to acceptable combinations of byte sequences, such as
UTF-8; if such a conversion fails, it's quite possible that another
encoding applies, such as ISO-8859-1, and you can try that. Since
ISO-8859-1 is a "liberal" encoding, in the sense that any byte value
or combination of byte values is acceptable, it should only be used as
a last resort.

However, it's best to have a high level of control over character
encodings rather than using tricks to avoid considering representation
issues carefully.

Paul

Cliff Wells · Nov 7, 2006

8<---------------------------------------

This is *good* advice - its unlikely to be followed though, as the OP is prolly
just like most of us - you unpack the stuff out of the box and start assembling
it, and only towards the end, when it wont fit together, do you read the manual
to see where you went wrong...

I fall right into this camp(fire). I'm always amazed and awed at people
who actually read the docs *thoroughly* before starting. I know some
people do but frankly, unless it's a step-by-step tutorial, I rarely
read the docs beyond getting a basic understanding of what something
does before I start "tinkering".

I've always been a firm believer in the Chinese proverb:

I hear and I forget
I see and I remember
I do and I understand

Of course, I usually just skip straight to the third step and try to
work backwards as needed. This usually works pretty well but when it
doesn't it fails horribly. Unfortunately (for me), working from step
one rarely works at all, so that's the boat I'm stuck in.

I've always been a bit miffed at the RTFM crowd (and somewhat jealous, I
admit). I *do* RTFM, but as often as not the fine manual confuses as
much as clarifies. I'm not convinced this is the result of poor
documentation so much as that I personally have a different mental
approach to problem-solving than the others who find documentation
universally enlightening. I also suspect that I'm not alone in my
approach and that the RTFM crowd is more than a little close-minded
about how others might think about and approach solving problems and
understanding concepts.

Also, much documentation (including the Python docs) tends to be
reference-manual style. This is great if you *already* understand the
problem and just need details, but does about as much for
*understanding* as a dictionary does for learning a language. When I'm
perusing the Python reference manual, I usually find that 10 lines of
example code are worth 1000 lines of function descriptions and
cross-references.

Just my $0.02.

Regards,
Cliff

Cliff Wells · Nov 8, 2006

What is in question is the likelihood that message B can appear before
message A, when both emanate from the same source, and B was sent about
7 minutes after A.

Usenet, email, usenet/email gateways, internet in general... all in
all, pretty likely. I've often seen replies to my posts long before my
own post shows up. In fact, I've seen posts not show up for several
hours.

Regards,
Cliff

Encoding trouble when script called from application	0	Jan 14, 2014
[email protected]	0	Jan 14, 2014
Cookie aint retrieving when visiting happens from a backlink.	1	Oct 25, 2013
How to work around a unicode problem?	4	Jan 24, 2012
What encoding does u'...' syntax use?	19	Feb 20, 2009
HTMLParser and non-ascii html pages	0	Sep 20, 2011
Thinking Unicode	0	Aug 8, 2013
Ascii to Unicode.	4	Jul 28, 2010

Unicode/ascii encoding nightmare

Thomas W

Mark Peters

Robert Kern

John Machin

John Machin

Georg Brandl

Andrea Griffini

Thomas W

John Machin

John Machin

Robert Kern

Gabriel Genellina

John Machin

Cameron Laird

John Machin

Hendrik van Rooyen

Andrea Griffini

Paul Boddie

Cliff Wells

Cliff Wells

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads