Newbie XML SAX Parsing: How do I ignore an invalid token?

S

scott

I've got an XML feed from a vendor that is not well-formed, and having
them change it is not an option. I'm trying to figure out how to
create an error-handler that will ignore the invalid token and continue
on.

The file is large, so I'd prefer not to put it all in memory or save it
off and strip out the bad characters before I parse it.

I've included one of the problematic characters in a small XML snippet
below.

I'm new to Python, and I don't know how to accomplish this. Any help is
greatly appreciated!

-----------------------------------------------------------------

Here is my code:

from xml.sax import make_parser
from xml.sax.handler import ContentHandler
import StringIO

class ErrorHandler:
def __init__(self, parser):
self.parser = parser
def warning(self, msg):
print '*** (ErrorHandler.warning) msg:', msg
def error(self, msg):
print '*** (ErrorHandler.error) msg:', msg
def fatalError(self, msg):
print msg

class ContentHandler(ContentHandler):
def __init__ (self):
pass
def startElement(self, name, attrs):
pass
def characters (self, ch):
pass
def endElement(self, name):
pass

xmlstr = """
<cities>
<city>
<name>Tampa</name>
<description>A great city  and place to live</description>
</city>
<city>
<name>Clearwater</name>
<description>Beautiful beaches</description>
</city>
</cities>
"""
parser = make_parser()
curHandler = ContentHandler()
errorHandler = ErrorHandler(parser)
parser.setContentHandler(curHandler)
parser.setErrorHandler(errorHandler)
parser.parse(StringIO.StringIO(xmlstr))
 
C

Chris Lambacher

What exactly is invalid about the XML fragment you provided?
It seems to parse correctly with ElementTree:... <cities>
... <city>
... <name>Tampa</name>
... <description>A great city ^^ and place to live</description>
... </city>
... <city>
... <name>Clearwater</name>
... <description>Beautiful beaches</description>
... </city>
<cities>
<city>
<name>Tampa</name>
<description>A great city ^^ and place to live</description>
</city>
<city>
<name>Clearwater</name>
<description>Beautiful beaches</description>

Do you have invalid characters? unclosed tags? The solution to each of these
problems is different. More info will solicit better solutions.

-Chris
 
S

scott

My original posting has a funky line break character (it appears as an
ascii square) that blows up my program, but it may or may not show up
when you view my message.

I was afraid to use element tree, since my xml files can be very long,
and I was concerned about using memory structures to hold all the data.
It is my understanding that SAX reads the file line by line?

Is there a way to account for the invalid token in the error handler? I
don't mind parsing out the bad characters on a case-by-case basis. The
weather data I am ingesting only seems to have this line break
character that the parser doesn't like.

Thanks!

Scott
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

My original posting has a funky line break character (it appears as an
ascii square) that blows up my program, but it may or may not show up
when you view my message.

Looking at your document, it seems that this "funky line break
character" is character \x1E, which, in latin-1, means "record
separator". It's indeed ill-formed to use it in XML.
Is there a way to account for the invalid token in the error handler?

Not with a standard XML parser, no. The error you describe is a "fatal
error", and that's not something parsing can recover from. I recommend
that you filter this character out before passing it to the XML parser.
You can use the IncrementalParser interface to do so.

Regards,
Martin
 
S

scott

Thanks, I'll work with the file on the file system, then parse it with
SAX.

Is there a Pythonic way to read the file and identify any illegal XML
characters so I can strip them out? this would keep my program more
flexible - if the vendor is going to allow one illegal character in
their document, there's no way of knowing if another one will pop up
later.

Thanks!
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Is there a Pythonic way to read the file and identify any illegal XML
characters so I can strip them out? this would keep my program more
flexible - if the vendor is going to allow one illegal character in
their document, there's no way of knowing if another one will pop up
later.

Notice that you are talking about bytes here, not characters. It is
inherently difficult to determine invalid bytes - you first have to
determine the encoding, then (mentally) decode, and then find out
whether there are any invalid characters.

The invalid XML characters can be found in

http://www.w3.org/TR/2006/REC-xml-20060816/#charsets

So invalid characters are #x0 .. #x8, #xB, #xC, #xE .. #x1F,
#xD800 .. #xDFFF, #xFFFE, #xFFFF.

If you restrict attention to only the invalid characters below
#x20 (i.e. control characters), and also restrict attention to
encodings that are strict ASCII supersets (ASCII, ISO-8859-x,
UTF-8), you can filter out the invalid characters on the byte
level. Otherwise, you have to decode, filter out on the character
level, and then encode again. Neither approach will deal with
bytes that are invalid wrt. the encoding.

To filter out these bytes, I recommend to use str.translate.
Make an identity table for the substitution, and put the
bytes you want deleted into the delete table.

Regards,
Martin
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,059
Latest member
cryptoseoagencies

Latest Threads

Top