XML with Unicode: what am I doing wrong?

K

Kevin Dangoor

This is a followup to a blog post I wrote the other day
http://www.blueskyonmars.com/archives/2005/01/31/using_unicode_with_elementtidy.html

I started out working in the context of elementtidy, but now I am
running into trouble in general Python-XML areas, so I thought I'd toss
the question out here. The code below is fairly self-explanatory. I have
a small HTML snippet that is UTF-8 encoded and is not 7-bit ASCII
compatible. I use Tidy to convert it to XHTML, and this particular setup
returns a unicode instance rather than a string.

import _elementtidy as et
from xml.parsers import expat

data = unicode(open("snippetWithUnicode.html").read(), "utf-8")
html = et.fixup(data)[0]
parser = expat.ParserCreate()
parser.Parse(html)

UnicodeEncodeError: 'ascii' codec can't encode character '\ub5' in
position 542: ordinal not in range(128)

If I set my default encoding to utf8 in sitecustomize.py, it works just
fine. I'm thinking that I can't be the only one trying to pass unicode
to expat... Is there something else I need to do here?

Thanks,
Kevin
Blazing Things
 
D

Diez B. Roggisch

I started out working in the context of elementtidy, but now I am
running into trouble in general Python-XML areas, so I thought I'd toss
the question out here. The code below is fairly self-explanatory. I have
a small HTML snippet that is UTF-8 encoded and is not 7-bit ASCII
compatible. I use Tidy to convert it to XHTML, and this particular setup
returns a unicode instance rather than a string.

import _elementtidy as et
from xml.parsers import expat

data = unicode(open("snippetWithUnicode.html").read(), "utf-8")
html = et.fixup(data)[0]
parser = expat.ParserCreate()
parser.Parse(html)

UnicodeEncodeError: 'ascii' codec can't encode character '\ub5' in
position 542: ordinal not in range(128)

If I set my default encoding to utf8 in sitecustomize.py, it works just
fine. I'm thinking that I can't be the only one trying to pass unicode
to expat... Is there something else I need to do here?

you confuse unicode with utf8. Expat can parse the latter - the former is
internal to python. And passing it to something that needs a string will
result in a conversion - which fails because of the ascii encoding.

Do this:

parser.Parse(html.encode('utf-8'))
 
J

Just

I started out working in the context of elementtidy, but now I am
running into trouble in general Python-XML areas, so I thought I'd toss
the question out here. The code below is fairly self-explanatory. I have
a small HTML snippet that is UTF-8 encoded and is not 7-bit ASCII
compatible. I use Tidy to convert it to XHTML, and this particular setup
returns a unicode instance rather than a string.

import _elementtidy as et
from xml.parsers import expat

data = unicode(open("snippetWithUnicode.html").read(), "utf-8")
html = et.fixup(data)[0]
parser = expat.ParserCreate()
parser.Parse(html)

UnicodeEncodeError: 'ascii' codec can't encode character '\ub5' in
position 542: ordinal not in range(128)

If I set my default encoding to utf8 in sitecustomize.py, it works just
fine. I'm thinking that I can't be the only one trying to pass unicode
to expat... Is there something else I need to do here?

you confuse unicode with utf8. Expat can parse the latter - the former is
internal to python. And passing it to something that needs a string will
result in a conversion - which fails because of the ascii encoding.

Do this:

parser.Parse(html.encode('utf-8'))[/QUOTE]

Possibly preceded by

parser = expat.ParserCreate('utf-8')

...so there's no confusion with the declared encoding, in case that's not
utf-8.

Just
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top