xml.dom.minidom character encoding

C

C. Benson Manica

I have the following simple script running on 2.5.2 on a machine where
the default character encoding is "ascii":

#!/usr/bin/env python
#coding: utf-8

import xml.dom.minidom
import codecs

str=u"<?xml version=\"1.0\" encoding=\"utf-8\"?><elements><elem attrib=
\"ó\"/></elements>"
doc=xml.dom.minidom.parseString( str )
xml=doc.toxml( encoding="utf-8" )
file=codecs.open( "foo.xml", "w", "utf-8" )
file.write( xml )
file.close()

I've specified utf-8 every place I can find that the documentation
allows me to, and yet this doesn't even come close to working without
UnicodeEncodeErrors. What on Earth do I have to do to please the
character encoding gods?
 
P

Peter Otten

C. Benson Manica said:
I have the following simple script running on 2.5.2 on a machine where
the default character encoding is "ascii":

#!/usr/bin/env python
#coding: utf-8

import xml.dom.minidom
import codecs

str=u"<?xml version=\"1.0\" encoding=\"utf-8\"?><elements><elem attrib=
\"ó\"/></elements>"
doc=xml.dom.minidom.parseString( str )
xml=doc.toxml( encoding="utf-8" )
file=codecs.open( "foo.xml", "w", "utf-8" )
file.write( xml )
file.close()

I've specified utf-8 every place I can find that the documentation
allows me to, and yet this doesn't even come close to working without
UnicodeEncodeErrors. What on Earth do I have to do to please the
character encoding gods?

Verify every step as you proceed?
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.5/xml/dom/minidom.py", line 1925, in parseString
return expatbuilder.parseString(string)
File "/usr/lib/python2.5/xml/dom/expatbuilder.py", line 940, in
parseString
return builder.parseString(string)
File "/usr/lib/python2.5/xml/dom/expatbuilder.py", line 223, in
parseString
parser.Parse(string, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf3' in position
62: ordinal not in range(128)

It seems that parseString() doesn't like unicode -- let's try a byte string
then:

No complaints -- let's have a look at the result:
'<?xml version="1.0" encoding="utf-8"?><elements><elem
attrib="\xc3\xb3"/></elements>'

That's a byte string, no need for codecs.open() then:

Peter
 
C

C. Benson Manica

It seems that parseString() doesn't like unicode

Yes, I noticed that, and I already tried...
-- let's try a byte string
then:

....except that it didn't work:

File "./demo.py", line 8, in <module>
doc=xml.dom.minidom.parseString( str.encode("utf-8") )
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position
62: ordinal not in range(128)
 
P

Peter Otten

C. Benson Manica said:
Yes, I noticed that, and I already tried...


...except that it didn't work:

File "./demo.py", line 8, in <module>
doc=xml.dom.minidom.parseString( str.encode("utf-8") )
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position
62: ordinal not in range(128)

Are you sure that your script has

str = u"..."

like in your post and not just

str = "..."

?

Peter
 
C

C. Benson Manica

Are you sure that your script has

str = u"..."

like in your post and not just

str = "..."

No :)

str=u"<?xml version=\"1.0\" encoding=\"utf-8\"?><elements><elem attrib=
\"ó\"/></elements>"
doc=xml.dom.minidom.parseString( str.encode("utf-8") )
xml=doc.toxml( encoding="utf-8")
file=codecs.open( "foo.xml", "w", "utf-8" )
file.write( xml )
file.close()

fails:

File "./demo.py", line 12, in <module>
file.write( xml )
File "/usr/lib/python2.5/codecs.py", line 638, in write
return self.writer.write(data)
File "/usr/lib/python2.5/codecs.py", line 303, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position
62: ordinal not in range(128)

but dropping the encoding argument to doc.toxml() seems to finally
work. I'd be curious to know why the code you posted (that worked for
you) didn't for me, but at this point I'm just happy with something
functional. Thank you very kindly!
 
P

Peter Otten

C. Benson Manica said:
No :)

str=u"<?xml version=\"1.0\" encoding=\"utf-8\"?><elements><elem attrib=
\"ó\"/></elements>"
doc=xml.dom.minidom.parseString( str.encode("utf-8") )
xml=doc.toxml( encoding="utf-8")
file=codecs.open( "foo.xml", "w", "utf-8" )
file.write( xml )
file.close()

fails:

File "./demo.py", line 12, in <module>
file.write( xml )
File "/usr/lib/python2.5/codecs.py", line 638, in write
return self.writer.write(data)
File "/usr/lib/python2.5/codecs.py", line 303, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position
62: ordinal not in range(128)

But that's a different error (codecs.open().write()) on a different line.
What you said was failing (xml.dom.minidom.parseString()) worked.
but dropping the encoding argument to doc.toxml() seems to finally
work. I'd be curious to know why the code you posted (that worked for
you) didn't for me, but at this point I'm just happy with something
functional. Thank you very kindly!

The following worked for me an should work for you, too:

$ cat tmp.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-

import xml.dom.minidom

str = u"<?xml version=\"1.0\" encoding=\"utf-8\"?><elements><elem
attrib=\"ó\"/></elements>"
doc = xml.dom.minidom.parseString(str.encode("utf-8"))

xml = doc.toxml(encoding="utf-8")

file = open("foo.xml", "w")
file.write( xml )
file.close()
$ python2.5 tmp.py
$ cat foo.xml
<?xml version="1.0" encoding="utf-8"?><elements><elem
attrib="ó"/></elements>$

Btw., str is a bad variable name because it shadows the builtin str type.

Peter
 
S

Stefan Behnel

C. Benson Manica, 21.04.2010 19:19:
I have the following simple script running on 2.5.2 on a machine where
the default character encoding is "ascii":

#!/usr/bin/env python
#coding: utf-8

import xml.dom.minidom
import codecs

str=u"<?xml version=\"1.0\" encoding=\"utf-8\"?><elements><elem attrib=
\"ó\"/></elements>"
doc=xml.dom.minidom.parseString( str )
xml=doc.toxml( encoding="utf-8" )
file=codecs.open( "foo.xml", "w", "utf-8" )
file.write( xml )
file.close()

You are trying to re-encode an already encoded output string here.
toxml(encoding="utf-8") returns a byte string. If you pass that into an
encoding file object (as returned by codecs.open()), which expects unicode
input, it will fail to re-encode the already encoded string. This gives a
bizarre error in Python 2.x and an understandable one in Python 3.

So the right solution is to let toxml() do the encoding and drop the use of
codecs.open() in favour of

f = open("foo.xml", "wb")

(mind the 'b' in the file mode, which stands for 'bytes' or 'binary')

Stefan
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,014
Latest member
BiancaFix3

Latest Threads

Top