Unicode and dictionaries

gizli · Jan 16, 2010

Hi all,

I am using Python 2.6.2 (release26-maint, Apr 19 2009, 01:56:41). I
ran into this issue yesterday and wanted to check to see if this is a
python bug. It seems that there is an inconsistency between lists and
dictionaries in the way that unicode objects are handled. Take a look
at the following example:

test_dict = {u'Ã¶ÄŸe':1}
u'Ã¶ÄŸe' in test_dict.keys() True
'Ã¶ÄŸe' in test_dict.keys() True
test_dict[u'Ã¶ÄŸe'] 1
test_dict['Ã¶ÄŸe']

Click to expand...

Click to expand...

Traceback (most recent call last):

Is this a bug? has_key functionality of the dictionary works as
expected:
False

Steven D'Aprano · Jan 16, 2010

Hi all,

I am using Python 2.6.2 (release26-maint, Apr 19 2009, 01:56:41). I ran
into this issue yesterday and wanted to check to see if this is a
python bug. It seems that there is an inconsistency between lists and
dictionaries in the way that unicode objects are handled. Take a look at
the following example:

True

I can't reproduce your result, at least not in 2.6.1:
__main__:1: UnicodeWarning: Unicode equal comparison failed to convert
both arguments to Unicode - interpreting them as being unequal
False

Carl Banks · Jan 17, 2010

I can't reproduce your result, at least not in 2.6.1:

__main__:1: UnicodeWarning: Unicode equal comparison failed to convert
both arguments to Unicode - interpreting them as being unequal
False

The OP changed his default encoding. I was able to confirm the
behavior after setting the default encoding to latin-1.

This is most definitely a bug in Python.

Carl Banks

Carl Banks · Jan 17, 2010

I would call this a bug. The two objects are different, so the latter
expression should return â€˜Falseâ€™.

Except the two objects are not different if default encoding is utf-8.

(Whether it's a good idea to change the default encoding is another
question, but Python is clearly documented as behaving this way. When
comparing a byte string and a Unicode string, the byte string will be
decoded according to the default encoding.)

FYI, â€˜foo in bar.keys()â€™ is easier to spell as â€˜foo in barâ€™.

I believe the OP's point was to show that dicts behave differently
than lists here ("in" works for lists, doesn't work for dicts).

Carl Banks

Carl Banks · Jan 17, 2010

The OP changed his default encoding. I was able to confirm the
behavior after setting the default encoding to latin-1.

This is most definitely a bug in Python.

I've thought it over and I'm not so sure it's a bug now, but it is
highly questionable. Here is more detailed explanation. The
following script shows why; my terminal is UTF-8.

Python 2.5.4 (r254:67916, Nov 19 2009, 19:46:21)
[GCC 4.3.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.

import sys
reload(sys) # get sys.setdefaultencoding back

Click to expand...

sys.setdefaultencoding('utf-8')
u'Ã¶ÄŸe' == 'Ã¶ÄŸe' True
test_dict = {u'Ã¶ÄŸe':1}
test_dict['Ã¶ÄŸe']

Click to expand...

Click to expand...

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
KeyError: '\xc3\xb6\xc4\x9fe'

So the source encoding is UTF-8, and you see I've set the default
encoding to UTF-8. You'll notice that u'Ã¶ÄŸe' and 'Ã¶ÄŸe' compare equal,
this is entirely correct. Given that UTF-8 is the source encoding,
the string 'Ã¶ÄŸe' will be read as a byte-string with the UTF-8 encoding
of those Unicode characters. And, given that UTF-8 is also the
default encoding, the string will be re-encoded using UTF-8, and so
will be equal to the Unicode stirng.

Given that the two are equal, the correct behavior for dicts would be
to use the two as the same key. However, it doesn't. In fact the two
objects don't even have the same hash code:
-813744964

This ought to be a bug; objects that compare equal and are hashable
must have the same hash code. However, given that it is crucially
important to be as fast as possible when calculating that hash code of
ASCII strings, I could imagine that this is deliberate. (And if it is
it should be documented so; I looked briefly but did not see it.)

I can imagine another buggy possibility as well. test_dict['Ã¶ÄŸe'] = 2
will add a new key to the above example, but it could overwrite the
key if there's a hash collision, because the objects compare equal.

All in all, it's a mighty mess. The best advice is to avoid it
altogether and leave the default encoding alone.

Thankfully Python 3 does away with all this nonsense.

Carl Banks

Carl Banks · Jan 17, 2010

They are different, because a Unicode object is *not* encoded in any
character encoding, whereas the byte string object is.

Of course they're different, it's not relevant to this situation.
What matters is if they compare equal, which is the only criteria for
whether an object is found in a list. x in s is true if there is some
object m in s for which m == x.

If the default encoding and the terminal encoding are both UTF-8 (or
both latin-9), then u'Ã¶ÄŸe' == 'Ã¶ÄŸe'. This behavior is documented (PEP
100) and therefore not a bug. Relevant lines:

"Unicode objects should compare equal to other objects after these
other objects have been coerced to Unicode. For strings this means
that they are interpreted as Unicode string using the <default
encoding>."

Carl Banks

gizli · Jan 17, 2010

Thanks to all of you. This once again proves how deep you can get
yourself into a mess if you mix unicode and string objects in your
code!

Martin v. Loewis · Jan 17, 2010

This ought to be a bug; objects that compare equal and are hashable

must have the same hash code.

It's not a bug. Changing the default encoding is not really supported,
let alone changing it to anything but latin-1, precisely for the reasons
you discuss.

If you do change the default encoding, Python *will* break. This has
been discussed many times, but some people still think they know better.

Regards,
Martin

Martin v. Loewis · Jan 17, 2010

Thanks to all of you. This once again proves how deep you can get

yourself into a mess if you mix unicode and string objects in your
code!

The specific issue is that you apparently changed the default encoding.
Don't do that, Python will break if you do.

Regards,
Martin

Steven D'Aprano · Jan 17, 2010

It's not a bug. Changing the default encoding is not really supported,
let alone changing it to anything but latin-1, precisely for the reasons
you discuss.

If you do change the default encoding, Python *will* break. This has
been discussed many times, but some people still think they know better.

That's specific to CPython though, isn't it? Other implementations may,
or may not, cope with it better?

Martin v. Loewis · Jan 17, 2010

This ought to be a bug; objects that compare equal and are hashable

That's specific to CPython though, isn't it? Other implementations may,
or may not, cope with it better?

No, that's fairly inherent to the problem. Only if that other
implementation doesn't use hashing for dictionaries, the problems
might go away. However, this is fairly unlikely - in particular,
since the language spec nearly mandates that dictionaries are hash-based
(rather than relying on comparability).

Regards,
Martin

expandtabs acts unexpectedly	2	Aug 19, 2009
Thinking Unicode	0	Aug 8, 2013
semantics of ** (unexpected/inconsistent?)	2	Nov 30, 2009
sqlite3 is sqlite 2?	11	Mar 12, 2010
? 'in' operator and fallback to __getitem__	6	May 18, 2009
Python 3.3, gettext and Unicode problems	0	Dec 31, 2012
How to install 64-bit python on Ubuntu	5	Oct 7, 2009
get back my simple little string after re search and replace	0	Jan 13, 2010

Unicode and dictionaries

gizli

Steven D'Aprano

Carl Banks

Carl Banks

Carl Banks

Carl Banks

gizli

Martin v. Loewis

Martin v. Loewis

Steven D'Aprano

Martin v. Loewis

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads