R
Robin Haswell
Okay I'm getting really frustrated with Python's Unicode handling, I'm
trying everything I can think of an I can't escape Unicode(En|De)codeError
no matter what I try.
Could someone explain to me what I'm doing wrong here, so I can hope to
throw light on the myriad of similar problems I'm having? Thanks
Python 2.4.1 (#2, May 6 2005, 11:22:24)
[GCC 3.3.6 (Debian 1:3.3.6-2)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Traceback (most recent call last):
Basically my app is a search engine - I'm grabbing content from pages
using HTMLParser and storing it in a database but I'm running in to these
problems all over the shop (from decoding the entities to calling
str.lower()) - I don't know what encoding my pages are coming in as, I'm
just happy enough to accept that they're either UTF-8 or latin-1 with
entities.
Any help would be great, I just hope that I have a brainwave over the
weekend because I've lost two days to Unicode errors now. It's even worse
that I've written the same app in PHP before with none of these problems -
and PHP4 doesn't even support Unicode.
Cheers
-Rob
trying everything I can think of an I can't escape Unicode(En|De)codeError
no matter what I try.
Could someone explain to me what I'm doing wrong here, so I can hope to
throw light on the myriad of similar problems I'm having? Thanks
Python 2.4.1 (#2, May 6 2005, 11:22:24)
[GCC 3.3.6 (Debian 1:3.3.6-2)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Traceback (most recent call last):import sys
sys.getdefaultencoding() 'utf-8'
import htmlentitydefs
char = htmlentitydefs.entitydefs["copy"] # this is an HTML © - a copyright symbol
print char ©
str = u"Apple"
print str Apple
str + char
Traceback (most recent call last):
Basically my app is a search engine - I'm grabbing content from pages
using HTMLParser and storing it in a database but I'm running in to these
problems all over the shop (from decoding the entities to calling
str.lower()) - I don't know what encoding my pages are coming in as, I'm
just happy enough to accept that they're either UTF-8 or latin-1 with
entities.
Any help would be great, I just hope that I have a brainwave over the
weekend because I've lost two days to Unicode errors now. It's even worse
that I've written the same app in PHP before with none of these problems -
and PHP4 doesn't even support Unicode.
Cheers
-Rob