xml.dom.minidom character encoding

C. Benson Manica · Apr 21, 2010

I have the following simple script running on 2.5.2 on a machine where
the default character encoding is "ascii":

#!/usr/bin/env python
#coding: utf-8

import xml.dom.minidom
import codecs

str=u"<?xml version=\"1.0\" encoding=\"utf-8\"?><elements><elem attrib=
\"ó\"/></elements>"
doc=xml.dom.minidom.parseString( str )
xml=doc.toxml( encoding="utf-8" )
file=codecs.open( "foo.xml", "w", "utf-8" )
file.write( xml )
file.close()

I've specified utf-8 every place I can find that the documentation
allows me to, and yet this doesn't even come close to working without
UnicodeEncodeErrors. What on Earth do I have to do to please the
character encoding gods?

Peter Otten · Apr 21, 2010

C. Benson Manica said:
I have the following simple script running on 2.5.2 on a machine where
the default character encoding is "ascii":

#!/usr/bin/env python
#coding: utf-8

import xml.dom.minidom
import codecs

str=u"<?xml version=\"1.0\" encoding=\"utf-8\"?><elements><elem attrib=
\"Ã³\"/></elements>"
doc=xml.dom.minidom.parseString( str )
xml=doc.toxml( encoding="utf-8" )
file=codecs.open( "foo.xml", "w", "utf-8" )
file.write( xml )
file.close()

I've specified utf-8 every place I can find that the documentation
allows me to, and yet this doesn't even come close to working without
UnicodeEncodeErrors. What on Earth do I have to do to please the
character encoding gods?

Verify every step as you proceed?
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.5/xml/dom/minidom.py", line 1925, in parseString
return expatbuilder.parseString(string)
File "/usr/lib/python2.5/xml/dom/expatbuilder.py", line 940, in
parseString
return builder.parseString(string)
File "/usr/lib/python2.5/xml/dom/expatbuilder.py", line 223, in
parseString
parser.Parse(string, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf3' in position
62: ordinal not in range(128)

It seems that parseString() doesn't like unicode -- let's try a byte string
then:

No complaints -- let's have a look at the result:
'<?xml version="1.0" encoding="utf-8"?><elements><elem
attrib="\xc3\xb3"/></elements>'

That's a byte string, no need for codecs.open() then:

Peter

C. Benson Manica · Apr 21, 2010

It seems that parseString() doesn't like unicode

Yes, I noticed that, and I already tried...

-- let's try a byte string
then:

....except that it didn't work:

File "./demo.py", line 8, in <module>
doc=xml.dom.minidom.parseString( str.encode("utf-8") )
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position
62: ordinal not in range(128)

Peter Otten · Apr 21, 2010

C. Benson Manica said:
Yes, I noticed that, and I already tried...

...except that it didn't work:

File "./demo.py", line 8, in <module>
doc=xml.dom.minidom.parseString( str.encode("utf-8") )
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position
62: ordinal not in range(128)

Are you sure that your script has

str = u"..."

like in your post and not just

str = "..."

?

Peter

C. Benson Manica · Apr 21, 2010

Are you sure that your script has

str = u"..."

like in your post and not just

str = "..."

No

str=u"<?xml version=\"1.0\" encoding=\"utf-8\"?><elements><elem attrib=
\"ó\"/></elements>"
doc=xml.dom.minidom.parseString( str.encode("utf-8") )
xml=doc.toxml( encoding="utf-8")
file=codecs.open( "foo.xml", "w", "utf-8" )
file.write( xml )
file.close()

fails:

File "./demo.py", line 12, in <module>
file.write( xml )
File "/usr/lib/python2.5/codecs.py", line 638, in write
return self.writer.write(data)
File "/usr/lib/python2.5/codecs.py", line 303, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position
62: ordinal not in range(128)

but dropping the encoding argument to doc.toxml() seems to finally
work. I'd be curious to know why the code you posted (that worked for
you) didn't for me, but at this point I'm just happy with something
functional. Thank you very kindly!

Peter Otten · Apr 21, 2010

C. Benson Manica said:
No

str=u"<?xml version=\"1.0\" encoding=\"utf-8\"?><elements><elem attrib=
\"Ã³\"/></elements>"
doc=xml.dom.minidom.parseString( str.encode("utf-8") )
xml=doc.toxml( encoding="utf-8")
file=codecs.open( "foo.xml", "w", "utf-8" )
file.write( xml )
file.close()

fails:

File "./demo.py", line 12, in <module>
file.write( xml )
File "/usr/lib/python2.5/codecs.py", line 638, in write
return self.writer.write(data)
File "/usr/lib/python2.5/codecs.py", line 303, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position
62: ordinal not in range(128)

But that's a different error (codecs.open().write()) on a different line.
What you said was failing (xml.dom.minidom.parseString()) worked.

but dropping the encoding argument to doc.toxml() seems to finally
work. I'd be curious to know why the code you posted (that worked for
you) didn't for me, but at this point I'm just happy with something
functional. Thank you very kindly!

The following worked for me an should work for you, too:

$ cat tmp.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-

import xml.dom.minidom

str = u"<?xml version=\"1.0\" encoding=\"utf-8\"?><elements><elem
attrib=\"Ã³\"/></elements>"
doc = xml.dom.minidom.parseString(str.encode("utf-8"))

xml = doc.toxml(encoding="utf-8")

file = open("foo.xml", "w")
file.write( xml )
file.close()
$ python2.5 tmp.py
$ cat foo.xml
<?xml version="1.0" encoding="utf-8"?><elements><elem
attrib="Ã³"/></elements>$

Btw., str is a bad variable name because it shadows the builtin str type.

Peter

Stefan Behnel · Apr 22, 2010

C. Benson Manica, 21.04.2010 19:19:

I have the following simple script running on 2.5.2 on a machine where
the default character encoding is "ascii":

#!/usr/bin/env python
#coding: utf-8

import xml.dom.minidom
import codecs

str=u"<?xml version=\"1.0\" encoding=\"utf-8\"?><elements><elem attrib=
\"ó\"/></elements>"
doc=xml.dom.minidom.parseString( str )
xml=doc.toxml( encoding="utf-8" )
file=codecs.open( "foo.xml", "w", "utf-8" )
file.write( xml )
file.close()

You are trying to re-encode an already encoded output string here.
toxml(encoding="utf-8") returns a byte string. If you pass that into an
encoding file object (as returned by codecs.open()), which expects unicode
input, it will fail to re-encode the already encoded string. This gives a
bizarre error in Python 2.x and an understandable one in Python 3.

So the right solution is to let toxml() do the encoding and drop the use of
codecs.open() in favour of

f = open("foo.xml", "wb")

(mind the 'b' in the file mode, which stands for 'bytes' or 'binary')

Stefan

XML parsing ExpatError with xml.dom.minidom at line 1, column 0	2	Feb 13, 2014
parse xml	5	Oct 15, 2010
xml.dom.minidom weirdness: bug?	2	Apr 30, 2008
encoding ascii data for xml	4	Oct 3, 2008
encoding error	1	Feb 20, 2013
encoding error in python 27	4	Feb 22, 2013
encoding latin1 to utf-8	6	Sep 10, 2007
xhtml encoding question	8	Jan 31, 2012

xml.dom.minidom character encoding

C. Benson Manica

Peter Otten

C. Benson Manica

Peter Otten

C. Benson Manica

Peter Otten

Stefan Behnel

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads