ElementTree XML parsing problem

M

Mike

I'm using ElementTree to parse an XML file, but it stops at the second
record (id = 002), which contains a non-standard ascii character, ä.
Here's the XML:

<?xml version="1.0"?>
<snapshot time="Mon Apr 25 08:47:23 PDT 2011">
<records>
<record id="001" education="High School" employment="7 yrs" />
<record id="002" education="Universität Bremen" employment="3 years" />
<record id="003" education="River College" employment="5 yrs" />
</records>
</snapshot>

The complaint offered up by the parser is

Unexpected error opening simple_fail.xml: not well-formed (invalid
token): line 5, column 40

and if I change the line to eliminate the ä, everything is wonderful.
The parser is perfectly happy with this modification:

<record id="002" education="University Bremen" employment="3 yrs" />

I can't find anything in the ElementTree docs about allowing additional
text characters or coercing strange ascii to Unicode.

Is there a way to coerce the text so it doesn't cause the parser to
raise an exception?

Here's my test script (simple_fail contains the offending line, and
simple_pass contains the line that passes).

import sys
import xml.etree.ElementTree as ET

def main():

xml_files = ['simple_fail.xml', 'simple_pass.xml']
for xml_file in xml_files:

print
print 'XML file: %s' % (xml_file)

try:
tree = ET.parse(xml_file)
except Exception, inst:
print "Unexpected error opening %s: %s" % (xml_file, inst)
continue

root = tree.getroot()
records = root.find('records')
for record in records:
print record.attrib['id'], record.attrib['education']

if __name__ == "__main__":
main()


Thanks,

-- Mike --
 
B

Benjamin Kaplan

I'm using ElementTree to parse an XML file, but it stops at the second
record (id = 002), which contains a non-standard ascii character, ä. Here's
the XML:

<?xml version="1.0"?>
<snapshot time="Mon Apr 25 08:47:23 PDT 2011">
<records>
<record id="001" education="High School" employment="7 yrs" />
<record id="002" education="Universität Bremen" employment="3 years" />
<record id="003" education="River College" employment="5 yrs" />
</records>
</snapshot>

The complaint offered up by the parser is

Unexpected error opening simple_fail.xml: not well-formed (invalid token):
line 5, column 40

and if I change the line to eliminate the ä, everything is wonderful. The
parser is perfectly happy with this modification:

<record id="002" education="University Bremen" employment="3 yrs" />

I can't find anything in the ElementTree docs about allowing additional text
characters or coercing strange ascii to Unicode.

Is there a way to coerce the text so it doesn't cause the parser to raisean
exception?

Have you tried specifying the file encoding? ä is not "strange ascii".
It's outside the ASCII range so if the parser expects ASCII, it will
get confused.
Here's my test script (simple_fail contains the offending line, and
simple_pass contains the line that passes).

import sys
import xml.etree.ElementTree as ET

def main():

   xml_files = ['simple_fail.xml', 'simple_pass.xml']
   for xml_file in xml_files:

       print
       print 'XML file: %s' % (xml_file)

       try:
           tree = ET.parse(xml_file)
       except Exception, inst:
           print "Unexpected error opening %s: %s" % (xml_file, inst)
           continue

       root = tree.getroot()
       records = root.find('records')
       for record in records:
           print record.attrib['id'], record.attrib['education']

if __name__ == "__main__":
       main()


Thanks,

-- Mike --
 
N

Neil Cerutti

I'm using ElementTree to parse an XML file, but it stops at the
second record (id = 002), which contains a non-standard ascii
character, ?. Here's the XML:

<?xml version="1.0"?>
<snapshot time="Mon Apr 25 08:47:23 PDT 2011">
<records>
<record id="001" education="High School" employment="7 yrs" />
<record id="002" education="Universit?t Bremen" employment="3 years" />
<record id="003" education="River College" employment="5 yrs" />
</records>
</snapshot>

The complaint offered up by the parser is

Unexpected error opening simple_fail.xml: not well-formed
(invalid token): line 5, column 40

It seems to be an invalid XML document, as another poster
indicated.
and if I change the line to eliminate the ?, everything is
wonderful. The parser is perfectly happy with this
modification:

<record id="002" education="University Bremen" employment="3
yrs" />

I can't find anything in the ElementTree docs about allowing
additional text characters or coercing strange ascii to
Unicode.

If you're not the one generating that bogus file, then you can
specify the encoding yourself instead by declaring an XMLParser.

import xml.etree.ElementTree as etree
with open('file.xml') as xml_file:
parser = etree.XMLParser(encoding='ISO-8859-1')
root = etree.parse(xml_file, parser=parser).getroot()
 
P

Philip Semanchuk

I'm using ElementTree to parse an XML file, but it stops at the second record (id = 002), which contains a non-standard ascii character, ä. Here's the XML:

<?xml version="1.0"?>
<snapshot time="Mon Apr 25 08:47:23 PDT 2011">
<records>
<record id="001" education="High School" employment="7 yrs" />
<record id="002" education="Universität Bremen" employment="3 years" />
<record id="003" education="River College" employment="5 yrs" />
</records>
</snapshot>

The complaint offered up by the parser is

Unexpected error opening simple_fail.xml: not well-formed (invalid token): line 5, column 40

You've gotten a number of good observations & suggestions already. I would add that if you're saving your XML file from a text editor, make sure you're saving it as UTF-8 and not ISO-8859-1 or Win-1252.


bye
Philip
 
H

Hegedüs Ervin

hello,
I'm using ElementTree to parse an XML file, but it stops at the
second record (id = 002), which contains a non-standard ascii
character, ä. Here's the XML:

<?xml version="1.0"?>
<snapshot time="Mon Apr 25 08:47:23 PDT 2011">
<records>
<record id="001" education="High School" employment="7 yrs" />
<record id="002" education="Universität Bremen" employment="3 years" />
<record id="003" education="River College" employment="5 yrs" />
</records>
</snapshot>

The complaint offered up by the parser is

I've checked this xml with your script, I think your locales
settings are not good.

$ ./parse.py

XML file: test.xml
001 High School
002 Universität Bremen
003 River College

(name of xml file is "test.xml")

So, I started change the codepage mark of xml:

<?xml version="1.0" encoding="UTF-8" ?> - same result
<?xml version="1.0" encoding="ISO-8859-2" ?> - same result
<?xml version="1.0" encoding="ISO-8859-1" ?> - same result

and then:
<?xml version="1.0" encoding="ascii" ?> - gives same error as you
described.

Try to change XML encoding.


a.
 
M

Mike

hello,


I've checked this xml with your script, I think your locales
settings are not good.

$ ./parse.py

XML file: test.xml
001 High School
002 Universität Bremen
003 River College

(name of xml file is "test.xml")

So, I started change the codepage mark of xml:

<?xml version="1.0" encoding="UTF-8" ?> - same result
<?xml version="1.0" encoding="ISO-8859-2" ?> - same result
<?xml version="1.0" encoding="ISO-8859-1" ?> - same result

and then:
<?xml version="1.0" encoding="ascii" ?> - gives same error as you
described.

Try to change XML encoding.


a.

Thanks, Hegedüs and everyone else who responded. That is exactly it -
I'm afraid I probably missed it in the docs because I was searching for
terms like "unicode" and "coerce." In any event, that solves the
problem. Thanks!

-- Mike --
 
M

Mike

It seems to be an invalid XML document, as another poster
indicated.


If you're not the one generating that bogus file, then you can
specify the encoding yourself instead by declaring an XMLParser.

import xml.etree.ElementTree as etree
with open('file.xml') as xml_file:
parser = etree.XMLParser(encoding='ISO-8859-1')
root = etree.parse(xml_file, parser=parser).getroot()

Thanks, Neil. I'm not generating the file, just trying to parse it. Your
solution is precisely what I was looking for, even if I didn't quite ask
correctly. I appreciate the help!

-- Mike --
 
S

Stefan Behnel

Hegedüs Ervin, 27.04.2011 21:33:
hello,


I've checked this xml with your script, I think your locales
settings are not good.

$ ./parse.py

XML file: test.xml
001 High School
002 Universität Bremen
003 River College

(name of xml file is "test.xml")

So, I started change the codepage mark of xml:

<?xml version="1.0" encoding="UTF-8" ?> - same result
<?xml version="1.0" encoding="ISO-8859-2" ?> - same result
<?xml version="1.0" encoding="ISO-8859-1" ?> - same result

You probably changed this in an editor that supports XML and thus saves the
file in the declared encoding. Switching between the three by simply
changing the first line (the XML declaration) and not adapting the encoding
of the document itself would otherwise not yield the same result for the
document given above.

Stefan
 
E

Ervin Hegedüs

hello,

You probably changed this in an editor that supports XML and thus
saves the file in the declared encoding.

no. I've saved the XML as UTF8, and didn't change the _file_
encoding - just modified the XML header, nothing else...

(I'm using Geany - it doesn't realize what user wrote in file,
just can save file as another encodign, when user choose one)

Switching between the three
by simply changing the first line (the XML declaration) and not
adapting the encoding of the document itself would otherwise not
yield the same result for the document given above.

yes, that's what I wrote exactly.


a.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,020
Latest member
GenesisGai

Latest Threads

Top