UnicodeEncodeError while reading xml file (newbie question)

nikosk · Jun 8, 2008

I just spent a whole day trying to read an xml file and I got stuck
with the following error:

Exception Type: UnicodeEncodeError
Exception Value: 'charmap' codec can't encode characters in position
164-167: character maps to <undefined>
Exception Location: C:\Python25\lib\encodings\cp1252.py in encode,
line 12

The string that could not be encoded/decoded was: H_C="ÊÉÍÁ" A_C

After some tests I can say with confidence that the error comes up
when python finds those greek characters after H_C="

The code that reads the file goes like this :

from xml.etree import ElementTree as ET

def read_xml(request):
data = open('live.xml', 'r').read()
data = data.decode('utf-8', 'replace')
data = ET.XML(data)

I've tried all the combinations of str.decode str.encode I could
think of but nothing.

Can someone please help ?

John Machin · Jun 8, 2008

I just spent a whole day trying to read an xml file and I got stuck
with the following error:

Exception Type: UnicodeEncodeError
Exception Value: 'charmap' codec can't encode characters in position
164-167: character maps to <undefined>
Exception Location: C:\Python25\lib\encodings\cp1252.py in encode,
line 12

The string that could not be encoded/decoded was: H_C="Î“ï¿½Î“ï¿½Î“ï¿½Î“ï¿½" A_C

After some tests I can say with confidence that the error comes up
when python finds those greek characters after H_C="

The code that reads the file goes like this :

from xml.etree import ElementTree as ET

def read_xml(request):
data = open('live.xml', 'r').read()
data = data.decode('utf-8', 'replace')
data = ET.XML(data)

I've tried all the combinations of str.decode str.encode I could
think of but nothing.

Can someone please help ?

Perhaps, with some more information:
(1) the *full* traceback
(2) what encoding is mentioned up the front of the XML file
(3) why you think you need to have "data.decode(.....)" at all
(4) why you think your input file is encoded in utf8 [which may be
answered by (2)]
(5) why you are using 'replace' (which would cover up (for a while)
any non-utf8 characters in your file)
(6) what "those greek characters" *really* are -- after fiddling with
encodings in my browser the best I can make of that is four capital
gamma characters each followed by a garbage byte or a '?'. Do
something like:

print repr(open('yourfile.xml', 'rb').read()[before_pos:after_pos])

(7) are you expecting non-ASCII characters after H_C= ? what
characters? when you open your xml file in a browser, what do you see
there?

nikosk · Jun 8, 2008

You won't believe how helpful your reply was. I was looking for a
problem that did not exist.
You wrote : > (3) why you think you need to have "data.decode(.....)"
at all
and after that : > (7) are you expecting non-ASCII characters after
H_C= ? what

characters? when you open your xml file in a browser, what do you see
there?

And I went back to see why I was doing this in the first place
(couldn't remember
after struggling for so many hours) and I opened the file in Interent
explorer.
The browser wouldn't open it because it didn't like the encoding
declared in the <xml> tag
"System does not support the specified encoding. Error processing
resource 'http://scores24live.com/xml/live.xml'. Line 1, ..."
(IE was the only program that complained, FF and some other tools
opened it without hassle)

Then I went back and looked for the original message that got me
struggling and it was this :
xml.parsers.expat.ExpatError: unknown encoding: line 1, column 30

From then on it was easy to see that it was the xml encoding that was
wrong :
<?xml version="1.0" encoding="utf8"?>

when I switched that to :
<?xml version="1.0" encoding="utf-8"?>

everything just worked.

I can't thank you enough for opening my eyes...

PS.: The UnicodeEncodeError must have something to do with Java's
UTF-8
implementation (the xml is produced by a Dom4j on a J2EE server).
Those characters I posted in the original message should
have read "ÎšÎ™ÎÎ‘" (China in Greek) but I after I copy pasted them in
the post
it came up like this : H_C="Î“ï¿½Î“ï¿½Î“ï¿½Î“ï¿½" A_C which is weird because
this
page is UTF encoded which means that characters should be 1 or 2 bytes
long.
From the message you see that instead of 4 characters it reads 8 which
means
there were extra information in the string.

If the above is true then it might be something for python developers
to address in the language. If someone wishes to investigate further
here is the link for info on java utf and the file that caused the
UnicodeEncodeError :
http://en.wikipedia.org/wiki/UTF-8 (the java section)
http://java.sun.com/javase/6/docs/api/java/io/DataInput.html#modified-utf-8

the xml file : http://dsigned.gr/live.xml

I just spent a whole day trying to read an xml file and I got stuck
with the following error:

Click to expand...

Exception Type: UnicodeEncodeError
Exception Value: 'charmap' codec can't encode characters in position
164-167: character maps to <undefined>
Exception Location: C:\Python25\lib\encodings\cp1252.py in encode,
line 12

Click to expand...

The string that could not be encoded/decoded was: H_C="Î“ï¿½Î“ï¿½Î“ï¿½Î“ï¿½" A_C

Click to expand...

After some tests I can say with confidence that the error comes up
when python finds those greek characters after H_C="

Click to expand...

The code that reads the file goes like this :

Click to expand...

from xml.etree import ElementTree as ET

Click to expand...

def read_xml(request):
data = open('live.xml', 'r').read()
data = data.decode('utf-8', 'replace')
data = ET.XML(data)

Click to expand...

I've tried all the combinations of str.decode str.encode I could
think of but nothing.

Click to expand...

Can someone please help ?

Click to expand...

Perhaps, with some more information:
(1) the *full* traceback
(2) what encoding is mentioned up the front of the XML file
(3) why you think you need to have "data.decode(.....)" at all
(4) why you think your input file is encoded in utf8 [which may be
answered by (2)]
(5) why you are using 'replace' (which would cover up (for a while)
any non-utf8 characters in your file)
(6) what "those greek characters" *really* are -- after fiddling with
encodings in my browser the best I can make of that is four capital
gamma characters each followed by a garbage byte or a '?'. Do
something like:

print repr(open('yourfile.xml', 'rb').read()[before_pos:after_pos])

(7) are you expecting non-ASCII characters after H_C= ? what
characters? when you open your xml file in a browser, what do you see
there?

[UnicodeEncodeError] Don't know what else to try	7	Nov 14, 2008
elementtree XML() unicode	5	Nov 4, 2009
ignoring chinese characters parsing xml file	3	Oct 22, 2007
newbie with a encoding question, please help	8	Apr 1, 2010
WinXP, Python3.1.2,dir-listing to XML - problem with unicode file names	0	Apr 3, 2010
Encoding/decoding: Still don't get it :-/	4	Mar 13, 2009
unicode box drawing	4	Mar 4, 2008
Encodign issue in Python 3.3.1 (once again)	42	May 26, 2013

UnicodeEncodeError while reading xml file (newbie question)

nikosk

John Machin

nikosk

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads