ElementTree XML parsing problem

Mike · Apr 27, 2011

I'm using ElementTree to parse an XML file, but it stops at the second
record (id = 002), which contains a non-standard ascii character, ä.
Here's the XML:

<?xml version="1.0"?>
<snapshot time="Mon Apr 25 08:47:23 PDT 2011">
<records>
<record id="001" education="High School" employment="7 yrs" />
<record id="002" education="Universität Bremen" employment="3 years" />
<record id="003" education="River College" employment="5 yrs" />
</records>
</snapshot>

The complaint offered up by the parser is

Unexpected error opening simple_fail.xml: not well-formed (invalid
token): line 5, column 40

and if I change the line to eliminate the ä, everything is wonderful.
The parser is perfectly happy with this modification:

<record id="002" education="University Bremen" employment="3 yrs" />

I can't find anything in the ElementTree docs about allowing additional
text characters or coercing strange ascii to Unicode.

Is there a way to coerce the text so it doesn't cause the parser to
raise an exception?

Here's my test script (simple_fail contains the offending line, and
simple_pass contains the line that passes).

import sys
import xml.etree.ElementTree as ET

def main():

xml_files = ['simple_fail.xml', 'simple_pass.xml']
for xml_file in xml_files:

print
print 'XML file: %s' % (xml_file)

try:
tree = ET.parse(xml_file)
except Exception, inst:
print "Unexpected error opening %s: %s" % (xml_file, inst)
continue

root = tree.getroot()
records = root.find('records')
for record in records:
print record.attrib['id'], record.attrib['education']

if __name__ == "__main__":
main()

Thanks,

-- Mike --

Benjamin Kaplan · Apr 27, 2011

I'm using ElementTree to parse an XML file, but it stops at the second
record (id = 002), which contains a non-standard ascii character, ä. Here's
the XML:

<?xml version="1.0"?>
<snapshot time="Mon Apr 25 08:47:23 PDT 2011">
<records>
<record id="001" education="High School" employment="7 yrs" />
<record id="002" education="Universität Bremen" employment="3 years" />
<record id="003" education="River College" employment="5 yrs" />
</records>
</snapshot>

The complaint offered up by the parser is

Unexpected error opening simple_fail.xml: not well-formed (invalid token):
line 5, column 40

and if I change the line to eliminate the ä, everything is wonderful. The
parser is perfectly happy with this modification:

<record id="002" education="University Bremen" employment="3 yrs" />

I can't find anything in the ElementTree docs about allowing additional text
characters or coercing strange ascii to Unicode.

Is there a way to coerce the text so it doesn't cause the parser to raisean
exception?

Have you tried specifying the file encoding? ä is not "strange ascii".
It's outside the ASCII range so if the parser expects ASCII, it will
get confused.

Here's my test script (simple_fail contains the offending line, and
simple_pass contains the line that passes).

import sys
import xml.etree.ElementTree as ET

def main():

xml_files = ['simple_fail.xml', 'simple_pass.xml']
for xml_file in xml_files:

print
print 'XML file: %s' % (xml_file)

try:
tree = ET.parse(xml_file)
except Exception, inst:
print "Unexpected error opening %s: %s" % (xml_file, inst)
continue

root = tree.getroot()
records = root.find('records')
for record in records:
print record.attrib['id'], record.attrib['education']

if __name__ == "__main__":
main()

Thanks,

-- Mike --

Neil Cerutti · Apr 27, 2011

I'm using ElementTree to parse an XML file, but it stops at the
second record (id = 002), which contains a non-standard ascii
character, ?. Here's the XML:

<?xml version="1.0"?>
<snapshot time="Mon Apr 25 08:47:23 PDT 2011">
<records>
<record id="001" education="High School" employment="7 yrs" />
<record id="002" education="Universit?t Bremen" employment="3 years" />
<record id="003" education="River College" employment="5 yrs" />
</records>
</snapshot>

The complaint offered up by the parser is

Unexpected error opening simple_fail.xml: not well-formed
(invalid token): line 5, column 40

It seems to be an invalid XML document, as another poster
indicated.

and if I change the line to eliminate the ?, everything is
wonderful. The parser is perfectly happy with this
modification:

<record id="002" education="University Bremen" employment="3
yrs" />

I can't find anything in the ElementTree docs about allowing
additional text characters or coercing strange ascii to
Unicode.

If you're not the one generating that bogus file, then you can
specify the encoding yourself instead by declaring an XMLParser.

import xml.etree.ElementTree as etree
with open('file.xml') as xml_file:
parser = etree.XMLParser(encoding='ISO-8859-1')
root = etree.parse(xml_file, parser=parser).getroot()

Philip Semanchuk · Apr 27, 2011

I'm using ElementTree to parse an XML file, but it stops at the second record (id = 002), which contains a non-standard ascii character, ä. Here's the XML:

<?xml version="1.0"?>
<snapshot time="Mon Apr 25 08:47:23 PDT 2011">
<records>
<record id="001" education="High School" employment="7 yrs" />
<record id="002" education="Universität Bremen" employment="3 years" />
<record id="003" education="River College" employment="5 yrs" />
</records>
</snapshot>

The complaint offered up by the parser is

Unexpected error opening simple_fail.xml: not well-formed (invalid token): line 5, column 40

You've gotten a number of good observations & suggestions already. I would add that if you're saving your XML file from a text editor, make sure you're saving it as UTF-8 and not ISO-8859-1 or Win-1252.

bye
Philip

HegedÃ¼s Ervin · Apr 27, 2011

hello,

I'm using ElementTree to parse an XML file, but it stops at the
second record (id = 002), which contains a non-standard ascii
character, Ã¤. Here's the XML:

<?xml version="1.0"?>
<snapshot time="Mon Apr 25 08:47:23 PDT 2011">
<records>
<record id="001" education="High School" employment="7 yrs" />
<record id="002" education="UniversitÃ¤t Bremen" employment="3 years" />
<record id="003" education="River College" employment="5 yrs" />
</records>
</snapshot>

The complaint offered up by the parser is

I've checked this xml with your script, I think your locales
settings are not good.

$ ./parse.py

XML file: test.xml
001 High School
002 UniversitÃ¤t Bremen
003 River College

(name of xml file is "test.xml")

So, I started change the codepage mark of xml:

<?xml version="1.0" encoding="UTF-8" ?> - same result
<?xml version="1.0" encoding="ISO-8859-2" ?> - same result
<?xml version="1.0" encoding="ISO-8859-1" ?> - same result

and then:
<?xml version="1.0" encoding="ascii" ?> - gives same error as you
described.

Try to change XML encoding.

a.

Mike · Apr 27, 2011

hello,

I've checked this xml with your script, I think your locales
settings are not good.

$ ./parse.py

XML file: test.xml
001 High School
002 UniversitÃ¤t Bremen
003 River College

(name of xml file is "test.xml")

So, I started change the codepage mark of xml:

<?xml version="1.0" encoding="UTF-8" ?> - same result
<?xml version="1.0" encoding="ISO-8859-2" ?> - same result
<?xml version="1.0" encoding="ISO-8859-1" ?> - same result

and then:
<?xml version="1.0" encoding="ascii" ?> - gives same error as you
described.

Try to change XML encoding.

a.

Thanks, HegedÃ¼s and everyone else who responded. That is exactly it -
I'm afraid I probably missed it in the docs because I was searching for
terms like "unicode" and "coerce." In any event, that solves the
problem. Thanks!

-- Mike --

Mike · Apr 27, 2011

It seems to be an invalid XML document, as another poster
indicated.

If you're not the one generating that bogus file, then you can
specify the encoding yourself instead by declaring an XMLParser.

import xml.etree.ElementTree as etree
with open('file.xml') as xml_file:
parser = etree.XMLParser(encoding='ISO-8859-1')
root = etree.parse(xml_file, parser=parser).getroot()

Thanks, Neil. I'm not generating the file, just trying to parse it. Your
solution is precisely what I was looking for, even if I didn't quite ask
correctly. I appreciate the help!

-- Mike --

Stefan Behnel · Apr 28, 2011

HegedÃ¼s Ervin, 27.04.2011 21:33:

hello,

I've checked this xml with your script, I think your locales
settings are not good.

$ ./parse.py

XML file: test.xml
001 High School
002 UniversitÃ¤t Bremen
003 River College

(name of xml file is "test.xml")

So, I started change the codepage mark of xml:

<?xml version="1.0" encoding="UTF-8" ?> - same result
<?xml version="1.0" encoding="ISO-8859-2" ?> - same result
<?xml version="1.0" encoding="ISO-8859-1" ?> - same result

You probably changed this in an editor that supports XML and thus saves the
file in the declared encoding. Switching between the three by simply
changing the first line (the XML declaration) and not adapting the encoding
of the document itself would otherwise not yield the same result for the
document given above.

Stefan

Ervin HegedÃ¼s · Apr 28, 2011

hello,

You probably changed this in an editor that supports XML and thus
saves the file in the declared encoding.

no. I've saved the XML as UTF8, and didn't change the _file_
encoding - just modified the XML header, nothing else...

(I'm using Geany - it doesn't realize what user wrote in file,
just can save file as another encodign, when user choose one)

Switching between the three
by simply changing the first line (the XML declaration) and not
adapting the encoding of the document itself would otherwise not
yield the same result for the document given above.

yes, that's what I wrote exactly.

a.

Parsing XML with ElementTree (unicode problem?)	13	Jul 23, 2007
ElementTree - Howto access text within XML tag element...	6	Aug 11, 2009
expat parsing error	0	Jun 1, 2010
xml file structure for use with ElementTree?	7	Oct 9, 2004
expat parsing error	10	Jun 1, 2010
HOWTO: Parsing email using Python part2	1	Jul 15, 2011
still don't get unicode and xml - help!	0	May 16, 2006
Get "java.lang.OutOfMemoryError" when Parsing an XML useing DOM	29	Mar 23, 2007

ElementTree XML parsing problem

Mike

Benjamin Kaplan

Neil Cerutti

Philip Semanchuk

HegedÃ¼s Ervin

Mike

Mike

Stefan Behnel

Ervin HegedÃ¼s

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads