XML SAX parser bug?

M

mitsura

Hi,

I think I ran into a bug in the XML SAX parser.

part of my program consist of reading a rather large XML file (about
10Mb) containing a few thousand elements.
I have the following problem. Sometimes that SAX parses misreads a
line.
Let me explain: the XML file contains a few thousand lines like this:
"
<TargetRef>WINOSSPI:Storage@@n91c90a.cmc.com</TargetRef>
"
where 'n91c90a.cmc.com' is the name of a system and thus changes per
system.
I a few cases, the SAX parser misreads the line. The parser sometimes
plits characters the line in:
"WINOSSPI:Storage@@n" and "91c90a.cmc.com".
I put a 'print characters' line in the 'characters' method of the
parser that is how I found out.
It only happens for a few of the thousand lines but you can imagine
that is very annoying.

I checked for errors in the XML file but the file seems ok.

Is this a bug or am I doing something wrong?
I am new to Python.

I am using Python 2.4.1, pyWin32 extension 2.4 and PyXML 0.8.4

Any help very much appreciated.

Kris
 
F

Fredrik Lundh

I think I ran into a bug in the XML SAX parser.

part of my program consist of reading a rather large XML file (about
10Mb) containing a few thousand elements.
I have the following problem. Sometimes that SAX parses misreads a
line.
Let me explain: the XML file contains a few thousand lines like this:
"
<TargetRef>WINOSSPI:Storage@@n91c90a.cmc.com</TargetRef>
"
where 'n91c90a.cmc.com' is the name of a system and thus changes per
system.
I a few cases, the SAX parser misreads the line. The parser sometimes
plits characters the line in:
"WINOSSPI:Storage@@n" and "91c90a.cmc.com".
I put a 'print characters' line in the 'characters' method of the
parser that is how I found out.
It only happens for a few of the thousand lines but you can imagine
that is very annoying.

I checked for errors in the XML file but the file seems ok.

Is this a bug or am I doing something wrong?

it's not a bug; the parser is free to split up character runs (due to buffering,
entities or character references, etc). it's up to you to merge character runs
into strings.

</F>
 
M

mitsura

Fredrik Lundh schreef:
it's not a bug; the parser is free to split up character runs (due to buffering,
entities or character references, etc). it's up to you to merge character runs
into strings.

</F>
Thanks for the feedback,

but how do I detect that the parser has split up the characters? I gues
I need to detect it in order to reconstruct the complete string
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

but how do I detect that the parser has split up the characters? I gues
I need to detect it in order to reconstruct the complete string

Don't try to detect it. Instead, assume it always happens, and collect
the strings in characters(), rather than processing them. Do something
like this

def startElement(self, ...):
self.chardata = ""

def characters(self, data):
self.chardata += data

def endElement(self, ...):
process(self.chardata)

This is simplified - you might have to deal with nested elements,
somehow.

Regards,
Martin
 
U

uche.ogbuji

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top