unicode encoding usablilty problem

aurora · Feb 18, 2005

I have long find the Python default encoding of strict ASCII frustrating.
For one thing I prefer to get garbage character than an exception. But the
biggest issue is Unicode exception often pop up in unexpected places and
only when a non-ASCII or unicode character first found its way into the
system.

Below is an example. The program may runs fine at the beginning. But as
soon as an unicode character u'b' is introduced, the program boom out
unexpectedly.
.... print a
åTraceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 0:
ordinal not in range(128)

One may suggest the correct way to do it is to use decode, such as

a.decode('latin-1') == b

This brings up another issue. Most references and books focus exclusive on
entering unicode literal and using the encode/decode methods. The fallacy
is that string is such a basic data type use throughout the program, you
really don't want to make a individual decision everytime when you use
string (and take a penalty for any negligence). The Java has a much more
usable model with unicode used internally and encoding/decoding decision
only need twice when dealing with input and output.

I am sure these errors are a nuisance to those who are half conscious to
unicode. Even for those who choose to use unicode, it is almost impossible
to ensure their program work correctly.

Fredrik Lundh · Feb 18, 2005

anonymous coward said:
This brings up another issue. Most references and books focus exclusive on entering unicode
literal and using the encode/decode methods. The fallacy is that string is such a basic data type
use throughout the program, you really don't want to make a individual decision everytime when
you use string (and take a penalty for any negligence). The Java has a much more usable model
with unicode used internally and encoding/decoding decision only need twice when dealing with
input and output.

that's how you should do things in Python too, of course. a unicode string
uses unicode internally. decode on the way in, encode on the way out, and
things just work.

the fact that you can mess things up by mixing unicode strings with binary
strings doesn't mean that you have to mix unicode strings with binary strings
in your program.

Even for those who choose to use unicode, it is almost impossible to ensure their program work
correctly.

well, if you use unicode the way it was intended to, it just works.

</F>

aurora · Feb 18, 2005

that's how you should do things in Python too, of course. a unicode
string
uses unicode internally. decode on the way in, encode on the way out, and
things just work.

the fact that you can mess things up by mixing unicode strings with
binary
strings doesn't mean that you have to mix unicode strings with binary
strings
in your program.

I don't want to mix them. But how could I find them? How do I know this
statement can be potential problem

if a==b:

where a and b can be instantiated individually far away from this line of
code that put them together?

In Java they are distinct data type and the compiler would catch all
incorrect usage. In Python, the interpreter seems to 'help' us to promote
binary string to unicode. Things works fine, unit tests pass, all until
the first non-ASCII characters come in and then the program breaks.

Is there a scheme for Python developer to use so that they are safe from
incorrect mixing?

=?ISO-8859-15?Q?Walter_D=F6rwald?= · Feb 18, 2005

aurora said:
> [...]
In Java they are distinct data type and the compiler would catch all
incorrect usage. In Python, the interpreter seems to 'help' us to
promote binary string to unicode. Things works fine, unit tests pass,
all until the first non-ASCII characters come in and then the program
breaks.

Is there a scheme for Python developer to use so that they are safe
from incorrect mixing?

Put the following:

import sys
sys.setdefaultencoding("undefined")

in a file named sitecustomize.py somewhere in your Python path and
Python will complain whenever there's an implicit conversion between
str and unicode.

HTH,
Walter Dörwald

Jarek Zgoda · Feb 18, 2005

Fredrik Lundh napisa³(a):

that's how you should do things in Python too, of course. a unicode string
uses unicode internally. decode on the way in, encode on the way out, and
things just work.

There are implementations of Python where it isn't so easy, Python for
iSeries (http://www.iseriespython.com/) is one of them. The code written
for "normal" platform doesn't work on AS/400, even if all strings used
internally are unicode objects, also unicode literals don't work as
expected.

Of course, this is implementation fault but this makes a headache if you
need to write portable code.

=?ISO-8859-15?Q?=22Martin_v=2E_L=F6wis=22?= · Feb 18, 2005

aurora said:
The Java
has a much more usable model with unicode used internally and
encoding/decoding decision only need twice when dealing with input and
output.

In addition to Fredrik's comment (that you should use the same model
in Python) and Walter's comment (that you can enforce it by setting
the default encoding to "undefined"), I'd like to point out the
historical reason: Python predates Unicode, so the byte string type
has many convenience operations that you would only expect of
a character string.

We have come up with a transition strategy, allowing existing
libraries to widen their support from byte strings to character
strings. This isn't a simple task, so many libraries still expect
and return byte strings, when they should process character strings.
Instead of breaking the libraries right away, we have defined
a transitional mechanism, which allows to add Unicode support
to libraries as the need arises. This transition is still in
progress.

Eventually, the primary string type should be the Unicode
string. If you are curious how far we are still off that goal,
just try running your program with the -U option.

Regards,
Martin

Thomas Heller · Feb 18, 2005

=?ISO-8859-15?Q?Walter_D=F6rwald?= said:
aurora said:

[...]
In Java they are distinct data type and the compiler would catch all
incorrect usage. In Python, the interpreter seems to 'help' us to
promote binary string to unicode. Things works fine, unit tests
pass, all until the first non-ASCII characters come in and then the
program breaks.
Is there a scheme for Python developer to use so that they are safe
from incorrect mixing?

Click to expand...

Put the following:

import sys
sys.setdefaultencoding("undefined")

in a file named sitecustomize.py somewhere in your Python path and
Python will complain whenever there's an implicit conversion between
str and unicode.

Sounds cool, so I did it.
And started a program I was currently working on.
The first function in it is this:

if sys.platform == "win32":

def _locate_gccxml():
import _winreg
for subkey in (r"Software\gccxml", r"Software\Kitware\GCC_XML"):
for root in (_winreg.HKEY_CURRENT_USER, _winreg.HKEY_LOCAL_MACHINE):
try:
hkey = _winreg.OpenKey(root, subkey, 0, _winreg.KEY_READ)
except WindowsError, detail:
if detail.errno != 2:
raise
else:
return _winreg.QueryValueEx(hkey, "loc")[0] + r"\bin"

loc = _locate_gccxml()
if loc:
os.environ["PATH"] = loc

All strings in that snippet are text strings, so the first approach was
to convert them to unicode literals. Doesn't work. Here is the final,
working version (changes are marked):

if sys.platform == "win32":

def _locate_gccxml():
import _winreg
for subkey in (r"Software\gccxml", r"Software\Kitware\GCC_XML"):
for root in (_winreg.HKEY_CURRENT_USER, _winreg.HKEY_LOCAL_MACHINE):
try:
hkey = _winreg.OpenKey(root, subkey, 0, _winreg.KEY_READ)
except WindowsError, detail:
if detail.errno != 2:
raise
else:
return _winreg.QueryValueEx(hkey, "loc")[0] + ur"\bin"
#-----------------------------------------------------------------^
loc = _locate_gccxml()
if loc:
os.environ["PATH"] = loc.encode("mbcs")
#--------------------------------^

So, it appears that:

- the _winreg.QueryValue function is strange: it takes ascii strings,
but returns a unicode string.
- _winreg.OpenKey takes ascii strings
- the os.environ["PATH"] accepts an ascii string.

And I won't guess what happens when there are registry entries with
unlauts (ok, they could be handled by 'mbcs' encoding), and with chinese
or japanese characters (no way to represent them in ascii strings with a
western locale and mbcs encoding, afaik).

I suggest that 'sys.setdefaultencoding("undefined")' be the standard
setting for the core developers ;-)

Thomas

Jarek Zgoda · Feb 18, 2005

Walter Dörwald napisa³(a):

Put the following:

import sys
sys.setdefaultencoding("undefined")

in a file named sitecustomize.py somewhere in your Python path and
Python will complain whenever there's an implicit conversion between
str and unicode.

This will help in your code, but there is big pile of modules in stdlib
that are not unicode-friendly. From my daily practice come shlex
(tokenizer works only with encoded strings) and logging (you cann't
specify encoding for FileHandler).

Thomas Heller · Feb 18, 2005

=?ISO-8859-15?Q?=22Martin_v=2E_L=F6wis=22?= said:
We have come up with a transition strategy, allowing existing
libraries to widen their support from byte strings to character
strings. This isn't a simple task, so many libraries still expect
and return byte strings, when they should process character strings.
Instead of breaking the libraries right away, we have defined
a transitional mechanism, which allows to add Unicode support
to libraries as the need arises. This transition is still in
progress.

Eventually, the primary string type should be the Unicode
string. If you are curious how far we are still off that goal,
just try running your program with the -U option.

Is it possible to specify a byte string literal when running with the -U option?

Thomas

Thomas Heller · Feb 18, 2005

=?ISO-8859-15?Q?=22Martin_v=2E_L=F6wis=22?= said:
Eventually, the primary string type should be the Unicode
string. If you are curious how far we are still off that goal,
just try running your program with the -U option.

Not very far - can't even call functions ;-)

c:\>py -U
Python 2.5a0 (#60, Dec 29 2004, 11:27:13) [MSC v.1310 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information..... pass
....Traceback (most recent call last):

Thomas

Neil Hodgson · Feb 18, 2005

Martin v. Löwis:

Eventually, the primary string type should be the Unicode
string. If you are curious how far we are still off that goal,
just try running your program with the -U option.

Tried both -U and sys.setdefaultencoding("undefined") on a couple of my
most used programs and saw a few library problems. One program reads job
advertisements from a mailing list, ranks them according to keywords, and
displays them using unicode to ensure that HTML entities like • are
displayed correctly. That program worked without changes.

The second program reads my spam filled mail box removing messages that
match a set of header criteria. It uses decode_header and make_header from
the email.Header library module to convert each header from a set of encoded
strings into a single unicode string. As email.Header is strongly concerned
with unicode, I expected it would be able to handle the two modifications
well.

With -U, there was one bug in my code assuming that a string would be 8
bit and that was easily fixed. In email.Charset, __init__ expects a
non-unicode argument as it immediately calls unicode(input_charset, 'ascii')
which fails when the argument is unicode. This can be fixed explicitly in
the __init__ but I would argue for a more lenient approach with unicode(u,
enc, err) always ignoring the enc and err arguments when the input is
already in unicode. Next sre breaks when building a mapping array because
array.array can not have a unicode type code. This should probably be fixed
in array rather than sre as mapping = array.array('b'.encode('ascii'),
mapping).tostring() is too ugly. The final issue was in encodings.idna where
there is ace_prefix = "xn--"; uace_prefix = unicode(ace_prefix, "ascii")
which again could avoid breakage if unicode was more lenient.

With sys.setdefaultencoding("undefined"), there were more problems and
they were harder to work around. One addition that could help would be a
function similar to str but with an optional encoding that would be used
when the input failed to convert to string because of a UnicodeError.
Something like

def stri(x, enc='us-ascii'):
try:
return str(x)
except UnicodeError:
return unicode(x).encode(enc)

Neil

aurora · Feb 18, 2005

aurora said:
aurora said:

[...]
In Java they are distinct data type and the compiler would catch all
incorrect usage. In Python, the interpreter seems to 'help' us to
promote binary string to unicode. Things works fine, unit tests pass,
all until the first non-ASCII characters come in and then the program
breaks.
Is there a scheme for Python developer to use so that they are safe
from incorrect mixing?

Click to expand...

Put the following:

import sys
sys.setdefaultencoding("undefined")

in a file named sitecustomize.py somewhere in your Python path and
Python will complain whenever there's an implicit conversion between
str and unicode.

HTH,
Walter Dörwald

That helps! Running unit test caught quite a few potential problems (as
well as a lot of safe of ASCII string promotion).

aurora · Feb 19, 2005

I'd like to point out the
historical reason: Python predates Unicode, so the byte string type
has many convenience operations that you would only expect of
a character string.

We have come up with a transition strategy, allowing existing
libraries to widen their support from byte strings to character
strings. This isn't a simple task, so many libraries still expect
and return byte strings, when they should process character strings.
Instead of breaking the libraries right away, we have defined
a transitional mechanism, which allows to add Unicode support
to libraries as the need arises. This transition is still in
progress.

I understand. So I wasn't yelling "why can't Python be more like Java". On
the other hand I also want to point out making individual decision for
each string wasn't practical and is very error prone. The fact that
unicode and 8 bit string look alike and work alike in common situation but
only run into problem with non-ASCII is very confusing for most people.

Eventually, the primary string type should be the Unicode
string. If you are curious how far we are still off that goal,
just try running your program with the -U option.

Lots of errors. Amount them are gzip (binary?!) and strftime??

I actually quite appriciate Python's power in processing binary data as
8-bit strings. But perhaps we should transition to use unicode as text
string as treat binary string as exception. Right now we have

'' - 8bit string; u'' unicode string

How about

b'' - 8bit string; '' unicode string

and no automatic conversion. Perhaps this can be activated by something
like the encoding declarations, so that transition can happen module by
module.

Alexander Schremmer · Feb 19, 2005

Not very far - can't even call functions ;-)

... pass
...
Traceback (most recent call last):
File "<stdin>", line 1, in ?
TypeError: f() keywords must be strings

That is possible:
WFM. SCNR,
Alexander

Fredrik Lundh · Feb 19, 2005

aurora said:
I don't want to mix them. But how could I find them? How do I know this statement can be
potential problem

if a==b:

where a and b can be instantiated individually far away from this line of code that put them
together?

if you don't know what a and b comes from, how can you be sure that
your program works at all? how can you be sure they're both strings?

("a op b" can fail in many ways, depending on what "a", "b", and "op" are)

Things works fine, unit tests pass, all until the first non-ASCII characters
come in and then the program breaks.

if you have unit tests, why don't they include Unicode tests?

</F>

Nick Coghlan · Feb 20, 2005

Thomas said:
Is it possible to specify a byte string literal when running with the -U option?

Not that I know of. If the 'bytes' type happens, then I'd be a fan of b"" to get
a byte string instead of a character string.

Cheers,
Nick.

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Feb 20, 2005

Thomas said:
Is it possible to specify a byte string literal when running with the -U option?

Not literally. However, you can specify things like

bytes = [0x47, 0x49, 0x4f, 0x50, 0x01, 0x00]
bytes = ''.join((chr(x) for x in bytes))

Alternatively, you could rely on the 1:1 feature of Latin-1:

bytes = "GIOP\x01\0".encode("l1")

Regards,
Martin

=?ISO-8859-15?Q?=22Martin_v=2E_L=F6wis=22?= · Feb 20, 2005

aurora said:
Lots of errors. Amount them are gzip (binary?!) and strftime??

For gzip, this is not surprising. It contains things like

self.fileobj.write('\037\213')

which is not intended to denote characters.

How about

b'' - 8bit string; '' unicode string

and no automatic conversion.

This has been proposed before, see PEP 332. The problem is that
people often want byte strings to be mutable as well, so it is
still unclear whether it is better to make the b prefix denote
the current string type (so it would be currently redundant)
or a newly-created mutable string type (similar to array.array).

Perhaps this can be activated by something
like the encoding declarations, so that transition can happen module by
module.

That could work for the literals - a __future__ import would be
most appropriate. For "no automatic conversion", this is very
difficult to implement on a per-module basis. The errors typically
don't occur in the module itself, but in some function called by
the module (e.g. a builtin method of the string type). So the
callee would have to know whether the caller has a future
import...

Regards,
Martin

Nick Coghlan · Feb 20, 2005

Martin said:
This has been proposed before, see PEP 332. The problem is that
people often want byte strings to be mutable as well, so it is
still unclear whether it is better to make the b prefix denote
the current string type (so it would be currently redundant)
or a newly-created mutable string type (similar to array.array).

Having "", u"", and r"" be immutable, while b"" was mutable would seem rather
inconsistent.

If you want a phased migration to 'assert (str is unicode) == True', then PEP
332 seems to have that covered:

1. Introduce 'bytes' as an alias of str
2. Introduce b"" as an alternate spelling of r""
3. Switch str to be an alias of unicode
4. Switch "" to be an alternate spelling of u""

Trying to intermingle this with making the bytes type mutable seems to be
begging for trouble - consider how many string-keyed dictionaries would break
with that change (the upgrade path is non-existent - you can't stay with str,
because you want byte strings, but you can't go to bytes, because you need
something immutable).

An alternative would be to have "bytestr" be the immutable type corresponding to
the current str (with b"" literals producing bytestr's), while reserving the
"bytes" name for a mutable byte sequence. That is, change PEP 332's upgrade path
to look more like:

* Add a bytestr builtin which is just a synonym for str. (2.5)
* Add a b"..." string literal which is equivalent to raw string literals,
with the exception that values which conflict with the source encoding of the
containing file not generate warnings. (2.5)
* Warn about the use of variables named "bytestr". (2.5 or 2.6)
* Introduce a bytestr builtin which refers to a sequence distinct from the
str type. (2.6)
* Make str a synonym for unicode. (3.0)

And separately:
* Introduce a bytes builtin which is a mutable byte sequence

Alternately, add array.bytes as a subclass of array.array, that provides a nicer
API for dealing specifically with byte strings.

The main point being, the replacement for 'str' needs to be immutable or the
upgrade process is going to be a serious PITA.

Cheers,
Nick.

=?ISO-8859-15?Q?=22Martin_v=2E_L=F6wis=22?= · Feb 20, 2005

Nick said:
Having "", u"", and r"" be immutable, while b"" was mutable would seem
rather inconsistent.

Yes. However, this inconsistency might be desirable. It would, of
course, mean that the literal cannot be a singleton. Instead, it has
to be a display (?), similar to list or dict displays: each execution
of the byte string literal creates a new object.

An alternative would be to have "bytestr" be the immutable type
corresponding to the current str (with b"" literals producing
bytestr's), while reserving the "bytes" name for a mutable byte
sequence.

Indeed. This maze of options has caused the process to get stuck.
People also argue that with such an approach, we could as well
tell users to use array.array for the mutable type. But then,
people complain that it doesn't have all the library support that
strings have.

The main point being, the replacement for 'str' needs to be immutable or
the upgrade process is going to be a serious PITA.

Somebody really needs to take this in his hands, completing the PEP,
writing a patch, checking applications to find out what breaks.

Regards,
Martin

Encoding trouble when script called from application	0	Jan 14, 2014
Preserving unicode filename encoding	1	Oct 20, 2012
Unicode	2	Mar 15, 2013
Unicode Chars in Windows Path	12	Apr 3, 2014
files.py (encoding error)	0	Jun 10, 2013
encoding problem	11	Dec 19, 2008
Python dict as unicode	1	Nov 24, 2010
Thinking Unicode	0	Aug 8, 2013

unicode encoding usablilty problem

aurora

Fredrik Lundh

aurora

=?ISO-8859-15?Q?Walter_D=F6rwald?=

Jarek Zgoda

=?ISO-8859-15?Q?=22Martin_v=2E_L=F6wis=22?=

Thomas Heller

Jarek Zgoda

Thomas Heller

Thomas Heller

Neil Hodgson

aurora

aurora

Alexander Schremmer

Fredrik Lundh

Nick Coghlan

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

=?ISO-8859-15?Q?=22Martin_v=2E_L=F6wis=22?=

Nick Coghlan

=?ISO-8859-15?Q?=22Martin_v=2E_L=F6wis=22?=

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads