Need debugging knowhow for my creeping Unicodephobia

K

kj

Some people have mathphobia. I'm developing a wicked case of
Unicodephobia.

I have read a *ton* of stuff on Unicode. It doesn't even seem all
that hard. Or so I think. Then I start writing code, and WHAM:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)

(There, see? My Unicodephobia just went up a notch.)

Here's the thing: I don't even know how to *begin* debugging errors
like this. This is where I could use some help.

In the past I've gone for method of choice of the clueless:
"programming by trial-and-error", try random crap until something
"works." And if that "strategy" fails, I come begging for help to
c.l.p. And thanks for the very effective pointers for getting rid
of the errors.

But afterwards I remain as clueless as ever... It's the old "give
a man a fish" vs. "teach a man to fish" story.

I need a systematic approach to troubleshooting and debugging these
Unicode errors. I don't know what. Some tools maybe. Some useful
modules or builtin commands. A diagnostic flowchart? I don't
think that any more RTFM on Unicode is going to help (I've done it
in spades), but if there's a particularly good write-up on Unicode
debugging, please let me know.

Any suggestions would be much appreciated.

FWIW, I'm using Python 2.6. The example above happens to come from
a script that extracts data from HTML files, which are all in
English, but they are a daily occurrence when I write code to
process non-English text. The script uses Beautiful Soup. I won't
post a lot of code because, as I said, what I'm after is not so
much a way around this specific error as much as the tools and
techniques to troubleshoot it and fix it on my own. But to ground
the problem a bit I'll say that the exception above happens during
the execution of a statement of the form:

x = '%s %s' % (y, z)

Also, I found that, with the exact same values y and z as above,
all of the following statements work perfectly fine:

x = '%s' % y
x = '%s' % z
print y
print z
print y, z

TIA!

~K
 
J

Jonathan Gardner

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)

You'll have to understand some terminology first.

"codec" is a description of how to encode and decode unicode data to a
stream of bytes.

"decode" means you are taking a series of bytes and converting it to
unicode.

"encode" is the opposite---take a unicode string and convert it to a
stream of bytes.

"ascii" is a codec that can only describe 0-127 with bytes 0-127.
"utf-8", "utf-16", etc... are other codecs. There's a lot of them.
Only some of them (ie, utf-8, utf-16) can encode all unicode. Most
(ie, ascii) can only do a subset of unicode.

In this case, you've fed a stream of bytes with 128 as one of the
bytes to the decoder. Since the decoder thinks it's working with
ascii, it doesn't know what to do with 128. There's a number of ways
to fix this:

(1) Feed it unicode instead, so it doesn't try to decode it.

(2) Tell it what encoding you are using, because it's obviously not
ascii.
FWIW, I'm using Python 2.6.  The example above happens to come from
a script that extracts data from HTML files, which are all in
English, but they are a daily occurrence when I write code to
process non-English text.  The script uses Beautiful Soup.  I won't
post a lot of code because, as I said, what I'm after is not so
much a way around this specific error as much as the tools and
techniques to troubleshoot it and fix it on my own.  But to ground
the problem a bit I'll say that the exception above happens during
the execution of a statement of the form:

  x = '%s %s' % (y, z)

Also, I found that, with the exact same values y and z as above,
all of the following statements work perfectly fine:

  x = '%s' % y
  x = '%s' % z
  print y
  print z
  print y, z

What are y and z? Are they unicode or strings? What are their values?

It sounds like someone, probably beautiful soup, is trying to turn
your strings into unicode. A full stacktrace would be useful to see
who did what where.
 
M

MRAB

kj said:
Some people have mathphobia. I'm developing a wicked case of
Unicodephobia.

I have read a *ton* of stuff on Unicode. It doesn't even seem all
that hard. Or so I think. Then I start writing code, and WHAM:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)

(There, see? My Unicodephobia just went up a notch.)

Here's the thing: I don't even know how to *begin* debugging errors
like this. This is where I could use some help.

In the past I've gone for method of choice of the clueless:
"programming by trial-and-error", try random crap until something
"works." And if that "strategy" fails, I come begging for help to
c.l.p. And thanks for the very effective pointers for getting rid
of the errors.

But afterwards I remain as clueless as ever... It's the old "give
a man a fish" vs. "teach a man to fish" story.

I need a systematic approach to troubleshooting and debugging these
Unicode errors. I don't know what. Some tools maybe. Some useful
modules or builtin commands. A diagnostic flowchart? I don't
think that any more RTFM on Unicode is going to help (I've done it
in spades), but if there's a particularly good write-up on Unicode
debugging, please let me know.

Any suggestions would be much appreciated.

FWIW, I'm using Python 2.6. The example above happens to come from
a script that extracts data from HTML files, which are all in
English, but they are a daily occurrence when I write code to
process non-English text. The script uses Beautiful Soup. I won't
post a lot of code because, as I said, what I'm after is not so
much a way around this specific error as much as the tools and
techniques to troubleshoot it and fix it on my own. But to ground
the problem a bit I'll say that the exception above happens during
the execution of a statement of the form:

x = '%s %s' % (y, z)

Also, I found that, with the exact same values y and z as above,
all of the following statements work perfectly fine:

x = '%s' % y
x = '%s' % z
print y
print z
print y, z
Decode all text input; encode all text output; do all text processing
in Unicode, which also means making all text literals Unicode (prefixed
with 'u').

Note: I'm talking about when you're working with _text_, as distinct
from when you're working with _binary data_, ie bytes.
 
A

Anthony Tolle

Some people have mathphobia.  I'm developing a wicked case of
Unicodephobia.
[snip]

Some general advice (Looks like I am reiterating what MRAB said -- I
type slower :):

1. If possible, use unicode strings for everything. That is, don't
use both str and unicode within the same project.

2. If that isn't possible, convert strings to unicode as early as
possible, work with them that way, then convert them back as late as
possible.

3. Know what type of string you are working with! If a function
returns or accepts a string value, verify whether the expected type is
unicode or str.

4. Consider switching to Python 3.x, since there is only one string
type (unicode).

--
 
K

kj

What are y and z?

x = "%s %s" % (table['id'], table.tr.renderContents())

where the variable table represents a BeautifulSoup.Tag instance.
Are they unicode or strings?

The first item (table['id']) is unicode, and the second is str.
What are their values?

The only easy way I know to examine the values of these strings is
to print them, which, I know, is very crude. (IOW, to answer this
question usefully, in the context of this problem, more Unicode
knowhow is needed than I have.) If I print them, the output for
the first one on my screen is "mainTable", and for the second it is

It sounds like someone, probably beautiful soup, is trying to turn
your strings into unicode. A full stacktrace would be useful to see
who did what where.

Unfortunately, there's not much in the stacktrace:

Traceback (most recent call last):
File "./download_tt.py", line 427, in <module>
x = "%s %s" % (table['id'], table.tr.renderContents())
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 41: ordinal not in range(128)

(NB: the difference between this error message and the one I
originally posted, namely the position of the unrecognized byte,
is because I simplified the code for the purpose of posting it
here, eliminating one additional processing of the second entry of
the tuple above.)

~K
 
D

David Malcolm

Some people have mathphobia. I'm developing a wicked case of
Unicodephobia.
[snip]

Some general advice (Looks like I am reiterating what MRAB said -- I
type slower :):

1. If possible, use unicode strings for everything. That is, don't
use both str and unicode within the same project.

2. If that isn't possible, convert strings to unicode as early as
possible, work with them that way, then convert them back as late as
possible.

3. Know what type of string you are working with! If a function
returns or accepts a string value, verify whether the expected type is
unicode or str.

4. Consider switching to Python 3.x, since there is only one string
type (unicode).

Some further nasty gotchas:

5. Be wary of the encoding of sys.stdout (and stderr/stdin), e.g. when
issuing a "print" statement: they can change on Unix depending on
whether the python process is directly connected to a tty or not.

(a) If they're directly connected to a tty, their encoding is taken from
the locale, UTF-8 on my machine:
[david@brick ~]$ python -c 'print u"\u03b1\u03b2\u03b3"'
αβγ
(prints alpha, beta, gamma to terminal, though these characters might
not survive being sent in this email)

(b) If they're not (e.g. cronjob, daemon, within a shell pipeline, etc)
their encoding is the default encoding, which is typically ascii;
rerunning the same command, but piping into "cat":
[david@brick ~]$ python -c 'print u"\u03b1\u03b2\u03b3"'| cat
Traceback (most recent call last):
File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position
0-2: ordinal not in range(128)

(c) These problems can lurk in sources and only manifest themselves
during _deployment_ of code. You can set PYTHONIOENCODING=ascii in the
environment to force (a) to behave like (b), so that your code will fail
whilst you're _developing_ it, rather than on your servers at midnight:
[david@brick ~]$ PYTHONIOENCODING=ascii python -c 'print u"\u03b1\u03b2
\u03b3"'
Traceback (most recent call last):
File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position
0-2: ordinal not in range(128)

(Given the above, it could be argued perhaps that one should never
"print" unicode instances, and instead should write the data to
file-like objects, specifying an encoding. Not sure).

6. If you're using pygtk (specifically the "pango" module, typically
implicitly imported), be warned that it abuses the C API to set the
default encoding inside python, which probably breaks any unicode
instances in memory at the time, and is likely to cause weird side
effects:
[david@brick ~]$ python
Python 2.6.2 (r262:71600, Jan 25 2010, 13:22:47)
[GCC 4.4.2 20100121 (Red Hat 4.4.2-28)] on linux2
Type "help", "copyright", "credits" or "license" for more information.'utf-8'
(the above is on Fedora 12, though I'd expect to see the same weirdness
on any linux distro running gnome 2)

Python 3 will probably make this all much easier; you'll still have to
care about encodings when dealing with files/sockets/etc, but it should
be much more clear what's going on. I hope.

Hope this is helpful
Dave
 
K

kj

One of y or z is unicode, the other is str.

Yes, that was the root of the problem.
1. Print the repr of each value so you can see which is which.

Thanks for pointing out repr; it's really useful when dealing with
Unicode headaches.



Thanks for all the replies!

~K
 
M

mk

kj said:
I have read a *ton* of stuff on Unicode. It doesn't even seem all
that hard. Or so I think. Then I start writing code, and WHAM:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)

(There, see? My Unicodephobia just went up a notch.)

Here's the thing: I don't even know how to *begin* debugging errors
like this. This is where I could use some help.
<type 'str'>


See what I mean? You encode INTO string, and decode OUT OF string.

To make matters more complicated, str.encode() internally DECODES from
string into unicode:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 0:
ordinal not in range(128)

There's logic to this, although it makes my brain want to explode. :)

Regards,
mk
 
K

kj

In said:
To make matters more complicated, str.encode() internally DECODES from
string into unicode:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 0:
ordinal not in range(128)
There's logic to this, although it makes my brain want to explode. :)


Thanks for pointing this one out! It could have easily pushed my
Unicodephobia into the incurable zone...

~K
 
M

MRAB

mk said:
<type 'str'>


See what I mean? You encode INTO string, and decode OUT OF string.
Traditionally strings were string of byte-sized characters. Because they
were byte-sided they could also be used to contain binary data.

Then along came Unicode.

When working with Unicode in Python 2, you should use the 'unicode' type
for text (Unicode strings) and limit the 'str' type to binary data
(bytestrings, ie bytes) only.

In Python 3 they've been renamed to 'str' for Unicode _strings_ and
'bytes' for binary data (bytes!).
To make matters more complicated, str.encode() internally DECODES from
string into unicode:

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 0:
ordinal not in range(128)

There's logic to this, although it makes my brain want to explode. :)
Strictly speaking, only Unicode can be encoded.

What Python 2 is doing here is trying to be helpful: if it's already a
bytestring then decode it first to Unicode and then re-encode it to a
bytestring.

Unfortunately, the default encoding is ASCII, and the bytestring isn't
valid ASCII. Python 2 is being 'helpful' in a bad way!
 
M

mk

MRAB said:
When working with Unicode in Python 2, you should use the 'unicode' type
for text (Unicode strings) and limit the 'str' type to binary data
(bytestrings, ie bytes) only.

Well OK, always use u'something', that's simple -- but isn't str what I
get from files and sockets and the like?
In Python 3 they've been renamed to 'str' for Unicode _strings_ and
'bytes' for binary data (bytes!).

Neat, except that the process of porting most projects and external
libraries to P3 seems to be, how should I put it, standing still? Or am
I wrong? But that's the impression I get?

Take web frameworks for example. Does any of them have serious plans and
work in place to port to P3?
Strictly speaking, only Unicode can be encoded.

How so? Can't bytestrings containing characters of, say, koi8r encoding
be encoded?
What Python 2 is doing here is trying to be helpful: if it's already a
bytestring then decode it first to Unicode and then re-encode it to a
bytestring.

It's really cumbersome sometimes, even if two libraries are written by
one author: for instance, Mako and SQLAlchemy are written by the same
guy. They are both top-of-the line in my humble opinion, but when you
connect them you get things like this:

1. you query SQLAlchemy object, that happens to have string fields in
relational DB.

2. Corresponding Python attributes of those objects then have type str,
not unicode.

3. then I pass those objects to Mako for HTML rendering.

Typically, it works: but if and only if a character in there does not
happen to be out of ASCII range. If it does, you get UnicodeDecodeError
on an unsuspecting user.

Sure, I wrote myself a helper that iterates over keyword dictionary to
make sure to convert all str to unicode and only then passes the
dictionary to render_unicode. It's an overhead, though. It would be
nicer to have it all unicode from db and then just pass it for rendering
and having it working. (unless there's something in filters that I
missed, but there's encoding of templates, tags, but I didn't find
anything on automatic conversion of objects passed to method rendering
template)

But maybe I'm whining.

Unfortunately, the default encoding is ASCII, and the bytestring isn't
valid ASCII. Python 2 is being 'helpful' in a bad way!

And the default encoding is coded in such way so it cannot be changed in
sitecustomize (without code modification, that is).

Regards,
mk
 
R

Robert Kern

MRAB wrote:

How so? Can't bytestrings containing characters of, say, koi8r encoding
be encoded?

I think he means that only unicode objects can be encoded using the .encode()
method, as clarified by his next sentence:

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco
 
S

Steve Holden

mk said:
Well OK, always use u'something', that's simple -- but isn't str what I
get from files and sockets and the like?
Yes, which is why you need to know what encoding was used to create it.
Neat, except that the process of porting most projects and external
libraries to P3 seems to be, how should I put it, standing still? Or am
I wrong? But that's the impression I get?
No, it's probably not going as quickly as you would like, but it's
certainly not standing still. Some of these libraries are substantial
works, and there were changes to the C API that take quite a bit of work
to adapt existing code to.
Take web frameworks for example. Does any of them have serious plans and
work in place to port to P3?
There have already been demonstrations of partially-working Python 3
Django. I can't speak to the rest.
How so? Can't bytestrings containing characters of, say, koi8r encoding
be encoded?
It's just terminology. If a bytestring contains koi8r characters then
(as you unconsciously recognized by your use of the word "encoding") it
already *has* been encoded.
It's really cumbersome sometimes, even if two libraries are written by
one author: for instance, Mako and SQLAlchemy are written by the same
guy. They are both top-of-the line in my humble opinion, but when you
connect them you get things like this:

1. you query SQLAlchemy object, that happens to have string fields in
relational DB.

2. Corresponding Python attributes of those objects then have type str,
not unicode.
Yes, a relational database will often return ASCII, but nowadays people
are increasingly using encoded Unicode. In that case you need to be
aware of the encoding that has been used to render the Unicode values
into the byte strings (which in Python 2 are of type str) so that you
can decode them into Unicode.
3. then I pass those objects to Mako for HTML rendering.

Typically, it works: but if and only if a character in there does not
happen to be out of ASCII range. If it does, you get UnicodeDecodeError
on an unsuspecting user.
Well first you need to be clear what you are passing to Mako.
Sure, I wrote myself a helper that iterates over keyword dictionary to
make sure to convert all str to unicode and only then passes the
dictionary to render_unicode. It's an overhead, though. It would be
nicer to have it all unicode from db and then just pass it for rendering
and having it working. (unless there's something in filters that I
missed, but there's encoding of templates, tags, but I didn't find
anything on automatic conversion of objects passed to method rendering
template)
Some database modules will distinguish between fields of type varchar
and nvarchar, returning Unicode objects for the latter. You will need to
ensure that the module knows which encoding is used in the database.
This is usually automatic.
But maybe I'm whining.
Nope, just struggling with a topic that is far from straightforward the
first time you encounter it.
And the default encoding is coded in such way so it cannot be changed in
sitecustomize (without code modification, that is).
Yes, the default encoding is not always convenient.

regards
Steve
 
T

Terry Reedy

Neat, except that the process of porting most projects and external
libraries to P3 seems to be, how should I put it, standing still?

What is important are the libraries, so more new projects can start in
3.x. There is a slow trickly of 3.x support announcements.
But maybe I'm whining.

Or perhaps explaining why 3.x unicode improvements are needed.

tjr
 
N

Nobody

4. Consider switching to Python 3.x, since there is only one string
type (unicode).

However: one drawback of Python 3.x is that the repr() of a Unicode string
is no longer restricted to ASCII. There is an ascii() function which
behaves like the 2.x repr(). However: the interpreter uses repr() for
displaying the value of an expression typed at the interactive prompt,
which results in "can't encode" errors if the string cannot be converted
to your locale's encoding.
 
J

John Nagle

Bear in mind that most Python implementations assume the "console"
only handles ASCII. So "print" output is converted to ASCII, which
can fail. (Actually, all modern Windows and Linux systems support
Unicode consoles, but Python somehow doesn't get this.)

John Nagle
 
J

John Nagle

kj said:
Some people have mathphobia. I'm developing a wicked case of
Unicodephobia.

I have read a *ton* of stuff on Unicode. It doesn't even seem all
that hard. Or so I think. Then I start writing code, and WHAM:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)

First, you haven't told us what platform you're on. Windows? Linux?
Something else?

If you're on Windows, and running Python from the command line, try
"cmd /u" before running Python. This will get you a Windows console that
will print Unicode. Python recognizes this, and "print" calls will
go out to the console in Unicode, which will then print the correct
characters if they're in the font being used by the Windows console.
Most European languages are covered in the standard font.

If you're using IDLE, or some Python debugger, it may need to be
told to have its window use Unicode.

John Nagle
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top