Unicode and dictionaries

Discussion in 'Python' started by gizli, Jan 16, 2010.

  1. gizli

    gizli Guest

    Hi all,

    I am using Python 2.6.2 (release26-maint, Apr 19 2009, 01:56:41). I
    ran into this issue yesterday and wanted to check to see if this is a
    python bug. It seems that there is an inconsistency between lists and
    dictionaries in the way that unicode objects are handled. Take a look
    at the following example:

    >>> test_dict = {u'öğe':1}
    >>> u'öğe' in test_dict.keys()

    True
    >>> 'öğe' in test_dict.keys()

    True
    >>> test_dict[u'öğe']

    1
    >>> test_dict['öğe']

    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    KeyError: '\xc3\xb6\xc4\x9fe'
    >>>


    Is this a bug? has_key functionality of the dictionary works as
    expected:

    >>> test_dict.has_key(u'öğe')

    True
    >>> test_dict.has_key('öğe')

    False
     
    gizli, Jan 16, 2010
    #1
    1. Advertising

  2. On Sat, 16 Jan 2010 15:35:05 -0800, gizli wrote:

    > Hi all,
    >
    > I am using Python 2.6.2 (release26-maint, Apr 19 2009, 01:56:41). I ran
    > into this issue yesterday and wanted to check to see if this is a
    > python bug. It seems that there is an inconsistency between lists and
    > dictionaries in the way that unicode objects are handled. Take a look at
    > the following example:
    >
    >>>> test_dict = {u'öğe':1}
    >>>> u'öğe' in test_dict.keys()

    > True
    >>>> 'öğe' in test_dict.keys()

    > True



    I can't reproduce your result, at least not in 2.6.1:

    >>> test_dict = {u'öğe':1}
    >>> u'öğe' in test_dict.keys()

    True
    >>> 'öğe' in test_dict.keys()

    __main__:1: UnicodeWarning: Unicode equal comparison failed to convert
    both arguments to Unicode - interpreting them as being unequal
    False



    --
    Steven
     
    Steven D'Aprano, Jan 16, 2010
    #2
    1. Advertising

  3. gizli

    Carl Banks Guest

    On Jan 16, 3:58 pm, Steven D'Aprano <st...@REMOVE-THIS-
    cybersource.com.au> wrote:
    > On Sat, 16 Jan 2010 15:35:05 -0800, gizli wrote:
    > > Hi all,

    >
    > > I am using Python 2.6.2 (release26-maint, Apr 19 2009, 01:56:41). I ran
    > > into this issue yesterday and wanted  to check to see if this is a
    > > python bug. It seems that there is an inconsistency between lists and
    > > dictionaries in the way that unicode objects are handled. Take a look at
    > > the following example:

    >
    > >>>> test_dict = {u'öğe':1}
    > >>>> u'öğe' in test_dict.keys()

    > > True
    > >>>> 'öğe' in test_dict.keys()

    > > True

    >
    > I can't reproduce your result, at least not in 2.6.1:
    >
    > >>> test_dict = {u'öğe':1}
    > >>> u'öğe' in test_dict.keys()

    > True
    > >>> 'öğe' in test_dict.keys()

    >
    > __main__:1: UnicodeWarning: Unicode equal comparison failed to convert
    > both arguments to Unicode - interpreting them as being unequal
    > False



    The OP changed his default encoding. I was able to confirm the
    behavior after setting the default encoding to latin-1.

    This is most definitely a bug in Python.


    Carl Banks
     
    Carl Banks, Jan 17, 2010
    #3
  4. gizli

    Carl Banks Guest

    On Jan 16, 3:56 pm, Ben Finney <> wrote:
    > gizli <> writes:
    > > >>> test_dict = {u'öğe':1}
    > > >>> u'öğe' in test_dict.keys()

    > > True
    > > >>> 'öğe' in test_dict.keys()

    > > True

    >
    > I would call this a bug. The two objects are different, so the latter
    > expression should return ‘False’.


    Except the two objects are not different if default encoding is utf-8.

    (Whether it's a good idea to change the default encoding is another
    question, but Python is clearly documented as behaving this way. When
    comparing a byte string and a Unicode string, the byte string will be
    decoded according to the default encoding.)


    > FYI, ‘foo in bar.keys()’ is easier to spell as ‘foo in bar’.


    I believe the OP's point was to show that dicts behave differently
    than lists here ("in" works for lists, doesn't work for dicts).


    Carl Banks
     
    Carl Banks, Jan 17, 2010
    #4
  5. gizli

    Carl Banks Guest

    On Jan 16, 5:38 pm, Carl Banks <> wrote:
    > On Jan 16, 3:58 pm, Steven D'Aprano <st...@REMOVE-THIS-
    > cybersource.com.au> wrote:
    > > On Sat, 16 Jan 2010 15:35:05 -0800, gizli wrote:
    > > > Hi all,

    >
    > > > I am using Python 2.6.2 (release26-maint, Apr 19 2009, 01:56:41). I ran
    > > > into this issue yesterday and wanted to check to see if this is a
    > > > python bug. It seems that there is an inconsistency between lists and
    > > > dictionaries in the way that unicode objects are handled. Take a look at
    > > > the following example:

    >
    > > >>>> test_dict = {u'öğe':1}
    > > >>>> u'öğe' in test_dict.keys()
    > > > True
    > > >>>> 'öğe' in test_dict.keys()
    > > > True

    >
    > > I can't reproduce your result, at least not in 2.6.1:

    >
    > > >>> test_dict = {u'öğe':1}
    > > >>> u'öğe' in test_dict.keys()

    > > True
    > > >>> 'öğe' in test_dict.keys()

    >
    > > __main__:1: UnicodeWarning: Unicode equal comparison failed to convert
    > > both arguments to Unicode - interpreting them as being unequal
    > > False

    >
    > The OP changed his default encoding. I was able to confirm the
    > behavior after setting the default encoding to latin-1.
    >
    > This is most definitely a bug in Python.


    I've thought it over and I'm not so sure it's a bug now, but it is
    highly questionable. Here is more detailed explanation. The
    following script shows why; my terminal is UTF-8.


    Python 2.5.4 (r254:67916, Nov 19 2009, 19:46:21)
    [GCC 4.3.4] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import sys
    >>> reload(sys) # get sys.setdefaultencoding back

    <module 'sys' (built-in)>
    >>> sys.setdefaultencoding('utf-8')
    >>> u'öğe' == 'öğe'

    True
    >>> test_dict = {u'öğe':1}
    >>> test_dict['öğe']

    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    KeyError: '\xc3\xb6\xc4\x9fe'


    So the source encoding is UTF-8, and you see I've set the default
    encoding to UTF-8. You'll notice that u'öğe' and 'öğe' compare equal,
    this is entirely correct. Given that UTF-8 is the source encoding,
    the string 'öğe' will be read as a byte-string with the UTF-8 encoding
    of those Unicode characters. And, given that UTF-8 is also the
    default encoding, the string will be re-encoded using UTF-8, and so
    will be equal to the Unicode stirng.

    Given that the two are equal, the correct behavior for dicts would be
    to use the two as the same key. However, it doesn't. In fact the two
    objects don't even have the same hash code:

    >>> hash(u'öğe')

    1671320785
    >>> hash('öğe')

    -813744964

    This ought to be a bug; objects that compare equal and are hashable
    must have the same hash code. However, given that it is crucially
    important to be as fast as possible when calculating that hash code of
    ASCII strings, I could imagine that this is deliberate. (And if it is
    it should be documented so; I looked briefly but did not see it.)

    I can imagine another buggy possibility as well. test_dict['öğe'] = 2
    will add a new key to the above example, but it could overwrite the
    key if there's a hash collision, because the objects compare equal.

    All in all, it's a mighty mess. The best advice is to avoid it
    altogether and leave the default encoding alone.

    Thankfully Python 3 does away with all this nonsense.


    Carl Banks
     
    Carl Banks, Jan 17, 2010
    #5
  6. gizli

    Carl Banks Guest

    On Jan 16, 7:06 pm, Ben Finney <> wrote:
    > Carl Banks <> writes:
    > > On Jan 16, 3:56 pm, Ben Finney <> wrote:
    > > > gizli <> writes:
    > > > > >>> test_dict = {u'öğe':1}
    > > > > >>> u'öğe' in test_dict.keys()
    > > > > True
    > > > > >>> 'öğe' in test_dict.keys()
    > > > > True

    >
    > > > I would call this a bug. The two objects are different, so the latter
    > > > expression should return ‘False’.

    >
    > > Except the two objects are not different if default encoding is utf-8.

    >
    > They are different, because a Unicode object is *not* encoded in any
    > character encoding, whereas the byte string object is.


    Of course they're different, it's not relevant to this situation.
    What matters is if they compare equal, which is the only criteria for
    whether an object is found in a list. x in s is true if there is some
    object m in s for which m == x.

    If the default encoding and the terminal encoding are both UTF-8 (or
    both latin-9), then u'öğe' == 'öğe'. This behavior is documented (PEP
    100) and therefore not a bug. Relevant lines:

    "Unicode objects should compare equal to other objects after these
    other objects have been coerced to Unicode. For strings this means
    that they are interpreted as Unicode string using the <default
    encoding>."



    Carl Banks
     
    Carl Banks, Jan 17, 2010
    #6
  7. gizli

    gizli Guest

    Thanks to all of you. This once again proves how deep you can get
    yourself into a mess if you mix unicode and string objects in your
    code!
     
    gizli, Jan 17, 2010
    #7
  8. > This ought to be a bug; objects that compare equal and are hashable
    > must have the same hash code.


    It's not a bug. Changing the default encoding is not really supported,
    let alone changing it to anything but latin-1, precisely for the reasons
    you discuss.

    If you do change the default encoding, Python *will* break. This has
    been discussed many times, but some people still think they know better.

    Regards,
    Martin
     
    Martin v. Loewis, Jan 17, 2010
    #8
  9. > Thanks to all of you. This once again proves how deep you can get
    > yourself into a mess if you mix unicode and string objects in your
    > code!


    The specific issue is that you apparently changed the default encoding.
    Don't do that, Python will break if you do.

    Regards,
    Martin
     
    Martin v. Loewis, Jan 17, 2010
    #9
  10. On Sun, 17 Jan 2010 10:49:44 +0100, Martin v. Loewis wrote:

    >> This ought to be a bug; objects that compare equal and are hashable
    >> must have the same hash code.

    >
    > It's not a bug. Changing the default encoding is not really supported,
    > let alone changing it to anything but latin-1, precisely for the reasons
    > you discuss.
    >
    > If you do change the default encoding, Python *will* break. This has
    > been discussed many times, but some people still think they know better.



    That's specific to CPython though, isn't it? Other implementations may,
    or may not, cope with it better?





    --
    Steven
     
    Steven D'Aprano, Jan 17, 2010
    #10
  11. >>> This ought to be a bug; objects that compare equal and are hashable
    >>> must have the same hash code.

    >> It's not a bug. Changing the default encoding is not really supported,
    >> let alone changing it to anything but latin-1, precisely for the reasons
    >> you discuss.
    >>
    >> If you do change the default encoding, Python *will* break. This has
    >> been discussed many times, but some people still think they know better.

    >
    >
    > That's specific to CPython though, isn't it? Other implementations may,
    > or may not, cope with it better?


    No, that's fairly inherent to the problem. Only if that other
    implementation doesn't use hashing for dictionaries, the problems
    might go away. However, this is fairly unlikely - in particular,
    since the language spec nearly mandates that dictionaries are hash-based
    (rather than relying on comparability).

    Regards,
    Martin
     
    Martin v. Loewis, Jan 17, 2010
    #11
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. lysdexia
    Replies:
    6
    Views:
    505
    John Machin
    Dec 2, 2007
  2. Brandon
    Replies:
    12
    Views:
    491
    Brandon
    Aug 15, 2008
  3. Grzegorz ¦liwiñski
    Replies:
    2
    Views:
    966
    Grzegorz ¦liwiñski
    Jan 19, 2011
  4. Chirag Mistry
    Replies:
    6
    Views:
    172
    Ollivier Robert
    Feb 8, 2008
  5. Terry Reedy
    Replies:
    0
    Views:
    75
    Terry Reedy
    Jan 7, 2014
Loading...

Share This Page