A
ashmir.d
Hi,
I am trying to parse an xml file using the minidom parser.
<code>
from xml.dom import minidom
xmlfilename = "sample.xml"
xmldoc = minidom.parse(xmlfilename)
</code>
The parser is failing on this line:
<mrcb245-c>Heinrich Kèufner, Norbert Nedopil, Heinz Schèoch (Hrsg.).</
mrcb245-c>
This is the error message I get:
Traceback (most recent call last):
File "readXML.py", line 11, in <module>
xmldoc = minidom.parse(xmlfilename)
File "C:\Python25\lib\xml\dom\minidom.py", line 1913, in parse
return expatbuilder.parse(file)
File "C:\Python25\lib\xml\dom\expatbuilder.py", line 924, in parse
result = builder.parseFile(fp)
File "C:\Python25\lib\xml\dom\expatbuilder.py", line 207, in
parseFile
parser.Parse(buffer, 0)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line
2254, column 21
It seems to me that it is having an issue with the 'è' character. I
have even tried the following to make sure it recognises the file as
utf-8 file:
<code>
from xml.dom import minidom
import codecs
xmlfilename = "sample.xml"
xmlfile = codecs.open(xmlfilename,"r","utf-8")
xmlstring = xmlfile.read()
xmldoc = minidom.parse(xmlfilename)
</code>
However, this doesn't work either and I get the following error
message:
Traceback (most recent call last):
File "readXML.py", line 9, in <module>
xmlstring = xmlfile.read()
File "C:\Python25\lib\codecs.py", line 618, in read
return self.reader.read(size)
File "C:\Python25\lib\codecs.py", line 424, in read
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position
69343-69345: invalid data
I'm assuming here that it is failing at the same place...
Can someone please point me in the right direction?
Thanks,
Ashmir
I am trying to parse an xml file using the minidom parser.
<code>
from xml.dom import minidom
xmlfilename = "sample.xml"
xmldoc = minidom.parse(xmlfilename)
</code>
The parser is failing on this line:
<mrcb245-c>Heinrich Kèufner, Norbert Nedopil, Heinz Schèoch (Hrsg.).</
mrcb245-c>
This is the error message I get:
Traceback (most recent call last):
File "readXML.py", line 11, in <module>
xmldoc = minidom.parse(xmlfilename)
File "C:\Python25\lib\xml\dom\minidom.py", line 1913, in parse
return expatbuilder.parse(file)
File "C:\Python25\lib\xml\dom\expatbuilder.py", line 924, in parse
result = builder.parseFile(fp)
File "C:\Python25\lib\xml\dom\expatbuilder.py", line 207, in
parseFile
parser.Parse(buffer, 0)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line
2254, column 21
It seems to me that it is having an issue with the 'è' character. I
have even tried the following to make sure it recognises the file as
utf-8 file:
<code>
from xml.dom import minidom
import codecs
xmlfilename = "sample.xml"
xmlfile = codecs.open(xmlfilename,"r","utf-8")
xmlstring = xmlfile.read()
xmldoc = minidom.parse(xmlfilename)
</code>
However, this doesn't work either and I get the following error
message:
Traceback (most recent call last):
File "readXML.py", line 9, in <module>
xmlstring = xmlfile.read()
File "C:\Python25\lib\codecs.py", line 618, in read
return self.reader.read(size)
File "C:\Python25\lib\codecs.py", line 424, in read
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position
69343-69345: invalid data
I'm assuming here that it is failing at the same place...
Can someone please point me in the right direction?
Thanks,
Ashmir