UnicodeDecodeError help please?

Discussion in 'Python' started by Robin Haswell, Apr 7, 2006.

  1. Okay I'm getting really frustrated with Python's Unicode handling, I'm
    trying everything I can think of an I can't escape Unicode(En|De)codeError
    no matter what I try.

    Could someone explain to me what I'm doing wrong here, so I can hope to
    throw light on the myriad of similar problems I'm having? Thanks :)

    Python 2.4.1 (#2, May 6 2005, 11:22:24)
    [GCC 3.3.6 (Debian 1:3.3.6-2)] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import sys
    >>> sys.getdefaultencoding()

    'utf-8'
    >>> import htmlentitydefs
    >>> char = htmlentitydefs.entitydefs["copy"] # this is an HTML © - a copyright symbol
    >>> print char

    ©
    >>> str = u"Apple"
    >>> print str

    Apple
    >>> str + char

    Traceback (most recent call last):
    File "<stdin>", line 1, in ?
    UnicodeDecodeError: 'utf8' codec can't decode byte 0xa9 in position 0: unexpected code byte
    >>> a = str+char

    Traceback (most recent call last):
    File "<stdin>", line 1, in ?
    UnicodeDecodeError: 'utf8' codec can't decode byte 0xa9 in position 0: unexpected code byte
    >>>


    Basically my app is a search engine - I'm grabbing content from pages
    using HTMLParser and storing it in a database but I'm running in to these
    problems all over the shop (from decoding the entities to calling
    str.lower()) - I don't know what encoding my pages are coming in as, I'm
    just happy enough to accept that they're either UTF-8 or latin-1 with
    entities.

    Any help would be great, I just hope that I have a brainwave over the
    weekend because I've lost two days to Unicode errors now. It's even worse
    that I've written the same app in PHP before with none of these problems -
    and PHP4 doesn't even support Unicode.

    Cheers

    -Rob
     
    Robin Haswell, Apr 7, 2006
    #1
    1. Advertising

  2. Robin Haswell

    Robert Kern Guest

    Robin Haswell wrote:
    > Okay I'm getting really frustrated with Python's Unicode handling, I'm
    > trying everything I can think of an I can't escape Unicode(En|De)codeError
    > no matter what I try.


    Have you read any of the documentation about Python's Unicode support? E.g.,

    http://effbot.org/zone/unicode-objects.htm

    > Could someone explain to me what I'm doing wrong here, so I can hope to
    > throw light on the myriad of similar problems I'm having? Thanks :)
    >
    > Python 2.4.1 (#2, May 6 2005, 11:22:24)
    > [GCC 3.3.6 (Debian 1:3.3.6-2)] on linux2
    > Type "help", "copyright", "credits" or "license" for more information.
    >
    >>>>import sys
    >>>>sys.getdefaultencoding()

    >
    > 'utf-8'


    How did this happen? It's supposed to be 'ascii' and not user-settable.

    >>>>import htmlentitydefs
    >>>>char = htmlentitydefs.entitydefs["copy"] # this is an HTML &copy; - a copyright symbol
    >>>>print char

    >
    > ©
    >
    >>>>str = u"Apple"
    >>>>print str

    >
    > Apple
    >
    >>>>str + char

    >
    > Traceback (most recent call last):
    > File "<stdin>", line 1, in ?
    > UnicodeDecodeError: 'utf8' codec can't decode byte 0xa9 in position 0: unexpected code byte
    >
    >>>>a = str+char

    >
    > Traceback (most recent call last):
    > File "<stdin>", line 1, in ?
    > UnicodeDecodeError: 'utf8' codec can't decode byte 0xa9 in position 0: unexpected code byte


    The values in htmlentitydefs.entitydefs are encoded in latin-1 (or are numeric
    entities which you still have to parse). So decode using the latin-1 codec.

    --
    Robert Kern


    "I have come to believe that the whole world is an enigma, a harmless enigma
    that is made terrible by our own mad attempt to interpret it as though it had
    an underlying truth."
    -- Umberto Eco
     
    Robert Kern, Apr 7, 2006
    #2
    1. Advertising

  3. Robin Haswell wrote:

    > Could someone explain to me what I'm doing wrong here, so I can hope to
    > throw light on the myriad of similar problems I'm having? Thanks :)
    >
    > Python 2.4.1 (#2, May 6 2005, 11:22:24)
    > [GCC 3.3.6 (Debian 1:3.3.6-2)] on linux2
    > Type "help", "copyright", "credits" or "license" for more information.
    > >>> import sys
    > >>> sys.getdefaultencoding()

    > 'utf-8'


    that's bad. do not hack the default encoding. it'll only make you sorry
    when you try to port your code to some other python installation, or use
    a library that relies on the factory settings being what they're supposed
    to be. do not hack the default encoding.

    back to your code:

    > >>> import htmlentitydefs
    > >>> char = htmlentitydefs.entitydefs["copy"] # this is an HTML &copy; - a copyright symbol
    > >>> print char

    > ©


    that's a standard (8-bit) string:

    >>> type(char)

    <type 'str'>
    >>> ord(char)

    169
    >>> len(char)

    1

    one byte that contains the value 169. looks like ISO-8859-1 (Latin-1) to me.
    let's see what the documentation says:

    entitydefs
    A dictionary mapping XHTML 1.0 entity definitions to their replacement
    text in ISO Latin-1.

    alright, so it's an ISO Latin-1 string.

    > >>> str = u"Apple"
    > >>> print str

    > Apple


    >>> type(str)

    <type 'unicode'>
    >>> len(str)

    5

    that's a 5-character unicode string.

    > >>> str + char

    > Traceback (most recent call last):
    > File "<stdin>", line 1, in ?
    > UnicodeDecodeError: 'utf8' codec can't decode byte 0xa9 in position 0:
    > unexpected code byte


    you're trying to combine an 8-bit string with a Unicode string, and you've
    told Python (by hacking the site module) to treat all 8-bit strings as if they
    contain UTF-8. UTF-8 != ISO-Latin-1.

    so, you can of course convert the string you got from the entitydefs dict
    to a unicode string before you combine the two strings

    >>> unicode(char, "iso-8859-1") + str

    u'\xa9Apple'

    but the htmlentitydefs module offers a better alternative:

    name2codepoint
    A dictionary that maps HTML entity names to the Unicode
    codepoints. New in version 2.3.

    which allows you to do

    >>> char = unichr(htmlentitydefs.name2codepoint["copy"])
    >>> char

    u'\xa9'
    >>> char + str

    u'\xa9Apple'

    without having to deal with things like

    >>> len(htmlentitydefs.entitydefs["copy"])

    1
    >>> len(htmlentitydefs.entitydefs["rarr"])

    7

    > Basically my app is a search engine - I'm grabbing content from pages
    > using HTMLParser and storing it in a database but I'm running in to these
    > problems all over the shop (from decoding the entities to calling
    > str.lower()) - I don't know what encoding my pages are coming in as, I'm
    > just happy enough to accept that they're either UTF-8 or latin-1 with
    > entities.


    UTF-8 and Latin-1 are two different things, so your (international) users
    will hate you if you don't do this right.

    > It's even worse that I've written the same app in PHP before with none of
    > these problems - and PHP4 doesn't even support Unicode.


    a PHP4 application without I18N problems? I'm not sure I believe you... ;-)

    </F>
     
    Fredrik Lundh, Apr 7, 2006
    #3
  4. Robin Haswell

    Paul Boddie Guest

    Robin Haswell wrote:
    > Okay I'm getting really frustrated with Python's Unicode handling, I'm
    > trying everything I can think of an I can't escape Unicode(En|De)codeError
    > no matter what I try.


    If you follow a few relatively simple rules, the days of Unicode errors
    will be over. Let's take a look!

    > Could someone explain to me what I'm doing wrong here, so I can hope to
    > throw light on the myriad of similar problems I'm having? Thanks :)
    >
    > Python 2.4.1 (#2, May 6 2005, 11:22:24)
    > [GCC 3.3.6 (Debian 1:3.3.6-2)] on linux2
    > Type "help", "copyright", "credits" or "license" for more information.
    > >>> import sys
    > >>> sys.getdefaultencoding()

    > 'utf-8'


    Note that this only specifies the encoding assumed to be used in plain
    strings when such strings are used to create Unicode objects. For some
    applications this is sufficient, but where you may be dealing with many
    different character sets (or encodings), having a default encoding will
    not be sufficient. This has an impact below and in your wider problem.

    > >>> import htmlentitydefs
    > >>> char = htmlentitydefs.entitydefs["copy"] # this is an HTML &copy; - a copyright symbol
    > >>> print char

    > ©


    It's better here to use repr(char) to see exactly what kind of object
    it is (or just give the name of the variable at the prompt). For me,
    it's a plain string, despite htmlentitydefs defining the each name in
    terms of its "Unicode codepoint". Moreover, for me the plain string
    uses the "Latin-1" (or more correctly iso-8859-1) character set, and I
    imagine that you get the same result.

    > >>> str = u"Apple"
    > >>> print str

    > Apple
    > >>> str + char

    > Traceback (most recent call last):
    > File "<stdin>", line 1, in ?
    > UnicodeDecodeError: 'utf8' codec can't decode byte 0xa9 in position 0: unexpected code byte


    Here, Python attempts to make a Unicode object from char, using the
    default encoding (which is utf-8) and finds that char is a plain string
    containing non-utf-8 character values, specifically a single iso-8859-1
    character value. It consequently complains. This is quite unfortunate
    since you probably expected Python to give you the entity definition
    either as a Unicode object or a plain string of your chosen encoding.

    Having never used htmlentitydefs before, I can only imagine that it
    provides plain strings containing iso-8859-1 values in order to support
    "legacy" HTML processing (given that semi-modern HTML favours
    entities, and XHTML uses genuine character sequences in the stated
    encoding), and that getting anything other than such strings might not
    be particularly useful.

    Anyway, what you'd do here is this:

    str + unicode(char, "iso-8859-1)

    Rule #1: if you have plain strings and you want them as Unicode, you
    must somewhere state what encoding those strings are in, preferably as
    you convert them to Unicode objects. Here, we can't rely on the default
    encoding being correct and must explicitly state a different encoding.
    Generally, stating the encoding is the right thing to do, rather than
    assuming some default setting that may differ across environments.
    Somehow, my default encoding is "ascii" not "utf-8", so your code would
    fail on my system by relying on the default encoding.

    [...]

    > Basically my app is a search engine - I'm grabbing content from pages
    > using HTMLParser and storing it in a database but I'm running in to these
    > problems all over the shop (from decoding the entities to calling
    > str.lower()) - I don't know what encoding my pages are coming in as, I'm
    > just happy enough to accept that they're either UTF-8 or latin-1 with
    > entities.


    Rule #2: get your content as Unicode as soon as possible, then work
    with it in Unicode. Once you've made your content Unicode, you
    shouldn't get UnicodeDecodeError all over the place, and the only time
    you then risk an UnicodeEncodeError is when you convert your content
    back to plain strings, typically for serialisation purposes.

    Rule #3: get acquainted with what kind of encodings apply to the
    incoming data. If you are prepared to assume that the data is either
    utf-8 or iso-8859-1, first try making Unicode objects from the data
    stating that utf-8 is the encoding employed, and only if that fails
    should you consider it as iso-8859-1, since an utf-8 string can quite
    happily be interpreted (incorrectly) as a bunch of iso-8859-1
    characters but not vice versa; thus, you have a primitive means of
    validation.

    > Any help would be great, I just hope that I have a brainwave over the
    > weekend because I've lost two days to Unicode errors now. It's even worse
    > that I've written the same app in PHP before with none of these problems -
    > and PHP4 doesn't even support Unicode.


    Perhaps that's why you never saw any such problems, but have you looked
    at the quality of your data?

    Paul
     
    Paul Boddie, Apr 7, 2006
    #4
  5. Robin Haswell

    Ben C Guest

    On 2006-04-07, Robin Haswell <> wrote:
    > Okay I'm getting really frustrated with Python's Unicode handling, I'm
    > trying everything I can think of an I can't escape Unicode(En|De)codeError
    > no matter what I try.
    >
    > Could someone explain to me what I'm doing wrong here, so I can hope to
    > throw light on the myriad of similar problems I'm having? Thanks :)
    >
    > Python 2.4.1 (#2, May 6 2005, 11:22:24)
    > [GCC 3.3.6 (Debian 1:3.3.6-2)] on linux2
    > Type "help", "copyright", "credits" or "license" for more information.
    >>>> import sys
    >>>> sys.getdefaultencoding()

    > 'utf-8'
    >>>> import htmlentitydefs
    >>>> char = htmlentitydefs.entitydefs["copy"] # this is an HTML &copy; - a copyright symbol
    >>>> print char

    > ©
    >>>> str = u"Apple"
    >>>> print str

    > Apple
    >>>> str + char

    > Traceback (most recent call last):
    > File "<stdin>", line 1, in ?
    > UnicodeDecodeError: 'utf8' codec can't decode byte 0xa9 in position 0: unexpected code byte
    >>>> a = str+char

    > Traceback (most recent call last):
    > File "<stdin>", line 1, in ?
    > UnicodeDecodeError: 'utf8' codec can't decode byte 0xa9 in position 0: unexpected code byte


    Try this:

    import htmlentitydefs

    char = htmlentitydefs.entitydefs["copy"]
    char = unicode(char, "Latin1")

    str = u"Apple"
    print str
    print str + char

    htmlentitydefs.entitydefs is "A dictionary mapping XHTML 1.0 entity
    definitions to their replacement text in ISO Latin-1".

    So you get "char" back as a Latin-1 string. Then we use the builtin
    function unicode to make a unicode string (which doesn't have an
    encoding, as I understand it, it's just unicode). This can be added to
    u"Apple" and printed out.

    It prints out OK on a UTF-8 terminal, but you can print it in other
    encodings using encode:

    print (str + char).encode("Latin1")

    for example.

    For your search engine you should look at server headers, metatags,
    BOMs, and guesswork, in roughly that order, to determine the encoding of
    the source document. Convert it all to unicode (using builtin function
    unicode) and use that to build your indexes etc., and write results out
    in whatever you need to write it out in (probably UTF-8).

    HTH.
     
    Ben C, Apr 7, 2006
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. KK
    Replies:
    2
    Views:
    596
    Big Brian
    Oct 14, 2003
  2. Ruslan
    Replies:
    1
    Views:
    504
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=
    Sep 7, 2004
  3. Robin Siebler
    Replies:
    4
    Views:
    26,341
    Tim Peters
    Oct 8, 2004
  4. Thomas Thomas

    UnicodeDecodeError

    Thomas Thomas, May 5, 2005, in forum: Python
    Replies:
    2
    Views:
    312
    Michael Spencer
    May 5, 2005
  5. Karl

    [Help]UnicodeDecodeError

    Karl, Mar 18, 2007, in forum: Python
    Replies:
    1
    Views:
    580
    Peter Otten
    Mar 18, 2007
Loading...

Share This Page