Problem with sets and Unicode strings

Discussion in 'Python' started by Dennis Benzinger, Jun 27, 2006.

  1. Hi!

    The following program in an UTF-8 encoded file:


    # -*- coding: UTF-8 -*-

    FIELDS = ("Fächer", )
    FROZEN_FIELDS = frozenset(FIELDS)
    FIELDS_SET = set(FIELDS)

    print u"Fächer" in FROZEN_FIELDS
    print u"Fächer" in FIELDS_SET
    print u"Fächer" in FIELDS


    gives this output


    False
    False
    Traceback (most recent call last):
    File "test.py", line 9, in ?
    print u"FÀcher" in FIELDS
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1:
    ordinal not in range(128)


    Why do the first two print statements succeed and the third one fails
    with an exception?

    Why does the use of set/frozenset remove the exception?


    Thanks,
    Dennis
     
    Dennis Benzinger, Jun 27, 2006
    #1
    1. Advertising

  2. Dennis Benzinger

    Serge Orlov Guest

    On 6/27/06, Dennis Benzinger <> wrote:
    > Hi!
    >
    > The following program in an UTF-8 encoded file:
    >
    >
    > # -*- coding: UTF-8 -*-
    >
    > FIELDS = ("Fächer", )
    > FROZEN_FIELDS = frozenset(FIELDS)
    > FIELDS_SET = set(FIELDS)
    >
    > print u"Fächer" in FROZEN_FIELDS
    > print u"Fächer" in FIELDS_SET
    > print u"Fächer" in FIELDS
    >
    >
    > gives this output
    >
    >
    > False
    > False
    > Traceback (most recent call last):
    > File "test.py", line 9, in ?
    > print u"FÀcher" in FIELDS
    > UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1:
    > ordinal not in range(128)
    >
    >
    > Why do the first two print statements succeed and the third one fails
    > with an exception?


    Actually all three statements fail to produce correct result.

    > Why does the use of set/frozenset remove the exception?


    Because sets use hash algorithm to find matches, whereas the last
    statement directly compares a unicode string with a byte string. Byte
    strings can only contain ascii characters, that's why python raises an
    exception. The problem is very easy to fix: use unicode strings for
    all non-ascii strings.
     
    Serge Orlov, Jun 27, 2006
    #2
    1. Advertising

  3. Serge Orlov wrote:
    > On 6/27/06, Dennis Benzinger <> wrote:
    >> Hi!
    >>
    >> The following program in an UTF-8 encoded file:
    >>
    >>
    >> # -*- coding: UTF-8 -*-
    >>
    >> FIELDS = ("Fächer", )
    >> FROZEN_FIELDS = frozenset(FIELDS)
    >> FIELDS_SET = set(FIELDS)
    >>
    >> print u"Fächer" in FROZEN_FIELDS
    >> print u"Fächer" in FIELDS_SET
    >> print u"Fächer" in FIELDS
    >>
    >>
    >> gives this output
    >>
    >>
    >> False
    >> False
    >> Traceback (most recent call last):
    >> File "test.py", line 9, in ?
    >> print u"FÀcher" in FIELDS
    >> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1:
    >> ordinal not in range(128)
    >>
    >>
    >> Why do the first two print statements succeed and the third one fails
    >> with an exception?

    >
    > Actually all three statements fail to produce correct result.


    So this is a bug in Python?

    > frozenset remove the exception?
    >
    > Because sets use hash algorithm to find matches, whereas the last
    > statement directly compares a unicode string with a byte string. Byte
    > strings can only contain ascii characters, that's why python raises an
    > exception. The problem is very easy to fix: use unicode strings for
    > all non-ascii strings.


    No, byte strings contain characters which are at least 8-bit wide
    <http://docs.python.org/ref/types.html>. But I don't understand what
    Python is trying to decode and why the exception says something about
    the ASCII codec, because my file is encoded with UTF-8.


    Dennis
     
    Dennis Benzinger, Jun 27, 2006
    #3
  4. Dennis Benzinger

    Serge Orlov Guest

    On 6/27/06, Dennis Benzinger <> wrote:
    > Serge Orlov wrote:
    > > On 6/27/06, Dennis Benzinger <> wrote:
    > >> Hi!
    > >>
    > >> The following program in an UTF-8 encoded file:
    > >>
    > >>
    > >> # -*- coding: UTF-8 -*-
    > >>
    > >> FIELDS = ("Fächer", )
    > >> FROZEN_FIELDS = frozenset(FIELDS)
    > >> FIELDS_SET = set(FIELDS)
    > >>
    > >> print u"Fächer" in FROZEN_FIELDS
    > >> print u"Fächer" in FIELDS_SET
    > >> print u"Fächer" in FIELDS
    > >>
    > >>
    > >> gives this output
    > >>
    > >>
    > >> False
    > >> False
    > >> Traceback (most recent call last):
    > >> File "test.py", line 9, in ?
    > >> print u"FÀcher" in FIELDS
    > >> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1:
    > >> ordinal not in range(128)
    > >>
    > >>
    > >> Why do the first two print statements succeed and the third one fails
    > >> with an exception?

    > >
    > > Actually all three statements fail to produce correct result.

    >
    > So this is a bug in Python?


    No.

    > > frozenset remove the exception?
    > >
    > > Because sets use hash algorithm to find matches, whereas the last
    > > statement directly compares a unicode string with a byte string. Byte
    > > strings can only contain ascii characters, that's why python raises an
    > > exception. The problem is very easy to fix: use unicode strings for
    > > all non-ascii strings.

    >
    > No, byte strings contain characters which are at least 8-bit wide
    > <http://docs.python.org/ref/types.html>.


    Yes, but later it's written that non-ascii characters do not have
    universal meaning assigned to them. In other words if you put byte
    0xE4 into a bytes string all python knows about it is that it's *some*
    character. If you put character U+00E4 into a unicode string python
    knows it's a "latin small letter a with diaeresis". Trying to compare
    *some* character with a specific character is obviously undefined.

    > But I don't understand what
    > Python is trying to decode and why the exception says something about
    > the ASCII codec, because my file is encoded with UTF-8.


    Because byte strings can come from different sources (network, files,
    etc) not only from the sources of your program python cannot assume
    all of them are utf-8. It assumes they are ascii, because most of
    wide-spread text encodings are ascii bases. Actually it's a guess,
    since there are utf-16, utf-32 and other non-ascii encodings. If you
    want to experience the life without guesses put
    sys.setdefaultencoding("undefined") into site.py
     
    Serge Orlov, Jun 27, 2006
    #4
  5. Dennis Benzinger

    Robert Kern Guest

    Dennis Benzinger wrote:
    > Serge Orlov wrote:
    >> On 6/27/06, Dennis Benzinger <> wrote:
    >>> Hi!
    >>>
    >>> The following program in an UTF-8 encoded file:
    >>>
    >>>
    >>> # -*- coding: UTF-8 -*-
    >>>
    >>> FIELDS = ("Fächer", )
    >>> FROZEN_FIELDS = frozenset(FIELDS)
    >>> FIELDS_SET = set(FIELDS)
    >>>
    >>> print u"Fächer" in FROZEN_FIELDS
    >>> print u"Fächer" in FIELDS_SET
    >>> print u"Fächer" in FIELDS
    >>>
    >>>
    >>> gives this output
    >>>
    >>>
    >>> False
    >>> False
    >>> Traceback (most recent call last):
    >>> File "test.py", line 9, in ?
    >>> print u"FÀcher" in FIELDS
    >>> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1:
    >>> ordinal not in range(128)
    >>>
    >>>
    >>> Why do the first two print statements succeed and the third one fails
    >>> with an exception?

    >> Actually all three statements fail to produce correct result.

    >
    > So this is a bug in Python?


    No.

    >> frozenset remove the exception?
    >>
    >> Because sets use hash algorithm to find matches, whereas the last
    >> statement directly compares a unicode string with a byte string. Byte
    >> strings can only contain ascii characters, that's why python raises an
    >> exception. The problem is very easy to fix: use unicode strings for
    >> all non-ascii strings.

    >
    > No, byte strings contain characters which are at least 8-bit wide
    > <http://docs.python.org/ref/types.html>. But I don't understand what
    > Python is trying to decode and why the exception says something about
    > the ASCII codec, because my file is encoded with UTF-8.


    Please read

    http://www.amk.ca/python/howto/unicode

    The string in all of the containers (FIELDS, FROZEN_FIELDS, FIELDS_SET) is a
    regular byte string, not a Unicode string. The encoding declaration only
    controls how the file is parsed. The string literal that you use for FIELDS is a
    regular string literal, not a Unicode string literal, so the object it creates
    is an 8-bit byte string. The tuple containment test is attempting to compare
    your Unicode string object to the regular string object for equality. Python
    does these comparisons by attempting to decode the regular string into a Unicode
    string. Since there is no encoding information present on regular strings at
    this point (since the encoding declaration in your file only controls parsing,
    nothing else), Python assumes ASCII and throws an exception otherwise.

    --
    Robert Kern

    "I have come to believe that the whole world is an enigma, a harmless enigma
    that is made terrible by our own mad attempt to interpret it as though it had
    an underlying truth."
    -- Umberto Eco
     
    Robert Kern, Jun 27, 2006
    #5
  6. Dennis Benzinger a écrit :
    > No, byte strings contain characters which are at least 8-bit wide
    > <http://docs.python.org/ref/types.html>. But I don't understand what
    > Python is trying to decode and why the exception says something about
    > the ASCII codec, because my file is encoded with UTF-8.


    [addendum to others replies]

    The file encoding directive is used by Python to convert u"xxx" strings
    into unicode objects using right conversion rules when compiling the code.
    When a string is written simply with "xxx", its a 8 bits string with NO
    encoding data associated. When these strings must be converted they are
    considered to be using sys.getdefaultencoding() [generally ascii -
    forced ascii in python 2.5]

    So a short reply: the utf8 directive has no effect on 8 bits strings,
    use unicode strings to manage correctly non-ascii texts.

    A+

    Laurent.
     
    Laurent Pointal, Jun 28, 2006
    #6
  7. Serge Orlov wrote:
    > On 6/27/06, Dennis Benzinger <> wrote:
    >> Serge Orlov wrote:
    >> > On 6/27/06, Dennis Benzinger <> wrote:
    >> >> Hi!
    >> >>
    >> >> The following program in an UTF-8 encoded file:
    >> >>
    >> >>
    >> >> # -*- coding: UTF-8 -*-
    >> >>
    >> >> FIELDS = ("Fächer", )
    >> >> FROZEN_FIELDS = frozenset(FIELDS)
    >> >> FIELDS_SET = set(FIELDS)
    >> >>
    >> >> print u"Fächer" in FROZEN_FIELDS
    >> >> print u"Fächer" in FIELDS_SET
    >> >> print u"Fächer" in FIELDS
    >> >>
    >> >>
    >> >> gives this output
    >> >>
    >> >>
    >> >> False
    >> >> False
    >> >> Traceback (most recent call last):
    >> >> File "test.py", line 9, in ?
    >> >> print u"FÀcher" in FIELDS
    >> >> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in

    >> position 1:
    >> >> ordinal not in range(128)
    >> >>
    >> >>
    >> >> Why do the first two print statements succeed and the third one fails
    >> >> with an exception?
    >> >
    >> > Actually all three statements fail to produce correct result.

    >>
    >> So this is a bug in Python?

    >
    > No.
    >
    >> > frozenset remove the exception?
    >> >
    >> > Because sets use hash algorithm to find matches, whereas the last
    >> > statement directly compares a unicode string with a byte string. Byte
    >> > strings can only contain ascii characters, that's why python raises an
    >> > exception. The problem is very easy to fix: use unicode strings for
    >> > all non-ascii strings.

    >>
    >> No, byte strings contain characters which are at least 8-bit wide
    >> <http://docs.python.org/ref/types.html>.

    >
    > Yes, but later it's written that non-ascii characters do not have
    > universal meaning assigned to them. In other words if you put byte
    > 0xE4 into a bytes string all python knows about it is that it's *some*
    > character. If you put character U+00E4 into a unicode string python
    > knows it's a "latin small letter a with diaeresis". Trying to compare
    > *some* character with a specific character is obviously undefined.
    > [...]


    But <http://docs.python.org/ref/comparisons.html> says:

    Strings are compared lexicographically using the numeric equivalents
    (the result of the built-in function ord()) of their characters. Unicode
    and 8-bit strings are fully interoperable in this behavior.

    Doesn't this mean that Unicode and 8-bit strings can be compared and
    this comparison is well defined? (even if it's is not meaningful)



    Thanks for your anwsers,
    Dennis
     
    Dennis Benzinger, Jun 28, 2006
    #7
  8. Robert Kern wrote:
    > Dennis Benzinger wrote:
    >> Serge Orlov wrote:
    >>> On 6/27/06, Dennis Benzinger <> wrote:
    >>>> Hi!
    >>>>
    >>>> The following program in an UTF-8 encoded file:
    >>>>
    >>>>
    >>>> # -*- coding: UTF-8 -*-
    >>>>
    >>>> FIELDS = ("Fächer", )
    >>>> FROZEN_FIELDS = frozenset(FIELDS)
    >>>> FIELDS_SET = set(FIELDS)
    >>>>
    >>>> print u"Fächer" in FROZEN_FIELDS
    >>>> print u"Fächer" in FIELDS_SET
    >>>> print u"Fächer" in FIELDS
    >>>>
    >>>>
    >>>> gives this output
    >>>>
    >>>>
    >>>> False
    >>>> False
    >>>> Traceback (most recent call last):
    >>>> File "test.py", line 9, in ?
    >>>> print u"FÀcher" in FIELDS
    >>>> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1:
    >>>> ordinal not in range(128)
    >>>>
    >>>>
    >>>> Why do the first two print statements succeed and the third one fails
    >>>> with an exception?
    >>> Actually all three statements fail to produce correct result.

    >>
    >> So this is a bug in Python?

    >
    > No.
    > [...]


    But I'd say that it's not intuitive that for sets x in y can be false
    (without raising an exception!) while the doing the same with a tuple
    raises an exception. Where is this difference documented?


    Thanks,
    Dennis
     
    Dennis Benzinger, Jun 28, 2006
    #8
  9. > But <http://docs.python.org/ref/comparisons.html> says:
    >
    > Strings are compared lexicographically using the numeric equivalents
    > (the result of the built-in function ord()) of their characters. Unicode
    > and 8-bit strings are fully interoperable in this behavior.
    >
    > Doesn't this mean that Unicode and 8-bit strings can be compared and
    > this comparison is well defined? (even if it's is not meaningful)


    Obviously not - otherwise you wouldn't have the problems you'd observed,
    wouldn't you?

    What happens of course is that in case of string to unicode-comparison, the
    string gets coerced to an unicode value - using the default encoding!


    # -*- coding: latin1 -*-

    print "ö".decode("latin1") == u"ö"
    print "ö" == u"ö"



    So - they are fully interoperable and the comparison is well defined - when
    the coercion is successful.

    Diez
     
    Diez B. Roggisch, Jun 28, 2006
    #9
  10. > But I'd say that it's not intuitive that for sets x in y can be false
    > (without raising an exception!) while the doing the same with a tuple
    > raises an exception. Where is this difference documented?


    2.3.7 Set Types -- set, frozenset

    ....

    Set elements are like dictionary keys; they need to define both __hash__ and
    __eq__ methods.
    ....

    And it has to hold that

    a == b => hash(a) == hash(b)

    but NOT

    hash(a) == hash(b) => a == b

    Thus if the hashes vary, the set doesn't bother to actually compare the
    values.

    Diez
     
    Diez B. Roggisch, Jun 28, 2006
    #10
  11. Diez B. Roggisch wrote:
    >> But I'd say that it's not intuitive that for sets x in y can be false
    >> (without raising an exception!) while the doing the same with a tuple
    >> raises an exception. Where is this difference documented?

    >
    > 2.3.7 Set Types -- set, frozenset
    >
    > ...
    >
    > Set elements are like dictionary keys; they need to define both __hash__ and
    > __eq__ methods.
    > ...
    >
    > And it has to hold that
    >
    > a == b => hash(a) == hash(b)
    >
    > but NOT
    >
    > hash(a) == hash(b) => a == b
    >
    > Thus if the hashes vary, the set doesn't bother to actually compare the
    > values.
    > [...]


    Ok, I understand.
    But isn't it a (minor) problem that using a set like this:

    # -*- coding: UTF-8 -*-

    FIELDS_SET = set(("Fächer", ))


    print u"Fächer" in FIELDS_SET
    print u"Fächer" == "Fächer"


    shadows the error of not setting sys.defaultencoding()?


    Dennis
     
    Dennis Benzinger, Jun 29, 2006
    #11
  12. Dennis Benzinger

    Robert Kern Guest

    Dennis Benzinger wrote:
    > Ok, I understand.
    > But isn't it a (minor) problem that using a set like this:
    >
    > # -*- coding: UTF-8 -*-
    >
    > FIELDS_SET = set(("Fächer", ))
    >
    > print u"Fächer" in FIELDS_SET
    > print u"Fächer" == "Fächer"
    >
    > shadows the error of not setting sys.defaultencoding()?


    You can't set the default encoding. If you could, then scripts that run on your
    machine wouldn't run on mine.

    If there's an error, it's the fact that you use a regular string at the
    beginning ("Fächer") and a unicode string later (u"Fächer"). But set objects
    can't know that that's the problem or even if it *is* a problem.

    --
    Robert Kern

    "I have come to believe that the whole world is an enigma, a harmless enigma
    that is made terrible by our own mad attempt to interpret it as though it had
    an underlying truth."
    -- Umberto Eco
     
    Robert Kern, Jun 29, 2006
    #12
  13. Robert Kern wrote:
    > Dennis Benzinger wrote:
    >> Ok, I understand.
    >> But isn't it a (minor) problem that using a set like this:
    >>
    >> # -*- coding: UTF-8 -*-
    >>
    >> FIELDS_SET = set(("Fächer", ))
    >>
    >> print u"Fächer" in FIELDS_SET
    >> print u"Fächer" == "Fächer"
    >>
    >> shadows the error of not setting sys.defaultencoding()?

    >
    > You can't set the default encoding. If you could, then scripts that run
    > on your machine wouldn't run on mine.
    > [...]


    As Serge Orlov wrote in one of his posts you _can_ set the default
    encoding (at least in site.py). See
    <http://docs.python.org/lib/module-sys.html>


    Bye,
    Dennis
     
    Dennis Benzinger, Jun 29, 2006
    #13
  14. Dennis Benzinger

    Robert Kern Guest

    Dennis Benzinger wrote:
    > Robert Kern wrote:
    >> Dennis Benzinger wrote:
    >>> Ok, I understand.
    >>> But isn't it a (minor) problem that using a set like this:
    >>>
    >>> # -*- coding: UTF-8 -*-
    >>>
    >>> FIELDS_SET = set(("Fächer", ))
    >>>
    >>> print u"Fächer" in FIELDS_SET
    >>> print u"Fächer" == "Fächer"
    >>>
    >>> shadows the error of not setting sys.defaultencoding()?

    >> You can't set the default encoding. If you could, then scripts that run
    >> on your machine wouldn't run on mine.
    >> [...]

    >
    > As Serge Orlov wrote in one of his posts you _can_ set the default
    > encoding (at least in site.py). See
    > <http://docs.python.org/lib/module-sys.html>


    Okay, *don't* set the default encoding to anything other than 'ascii'. Doing so
    would be an error, not the other way around.

    --
    Robert Kern

    "I have come to believe that the whole world is an enigma, a harmless enigma
    that is made terrible by our own mad attempt to interpret it as though it had
    an underlying truth."
    -- Umberto Eco
     
    Robert Kern, Jun 29, 2006
    #14
  15. Dennis Benzinger wrote:

    >>> shadows the error of not setting sys.defaultencoding()?

    >>
    >> You can't set the default encoding. If you could, then scripts that run
    >> on your machine wouldn't run on mine.
    >> [...]

    >
    > As Serge Orlov wrote in one of his posts you _can_ set the default
    > encoding (at least in site.py). See
    > <http://docs.python.org/lib/module-sys.html>


    yes, but you're not supposed to do that, for several reasons, including
    the reasons Robert provided: if you mess with the interpreter defaults,
    code you write isn't portable, and code written by others may not work
    on your machine.

    the interpreter isn't fully encoding agnostic either; things are not
    guaranteed to work properly if you're not using the default.

    </F>
     
    Fredrik Lundh, Jun 29, 2006
    #15
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. jb
    Replies:
    5
    Views:
    410
    Benjamin Niemann
    Mar 29, 2006
  2. Ben

    Strings, Strings and Damned Strings

    Ben, Jun 22, 2006, in forum: C Programming
    Replies:
    14
    Views:
    814
    Malcolm
    Jun 24, 2006
  3. JBorges
    Replies:
    5
    Views:
    332
    JBorges
    Jul 29, 2005
  4. Asterix
    Replies:
    5
    Views:
    753
    Matt Nordhoff
    Aug 31, 2008
  5. Michal Ludvig

    File names, character sets and Unicode

    Michal Ludvig, Dec 12, 2008, in forum: Python
    Replies:
    1
    Views:
    334
    Marc 'BlackJack' Rintsch
    Dec 12, 2008
Loading...

Share This Page