Newbie XML SAX Parsing: How do I ignore an invalid token?

scott · Jan 5, 2007

I've got an XML feed from a vendor that is not well-formed, and having
them change it is not an option. I'm trying to figure out how to
create an error-handler that will ignore the invalid token and continue
on.

The file is large, so I'd prefer not to put it all in memory or save it
off and strip out the bad characters before I parse it.

I've included one of the problematic characters in a small XML snippet
below.

I'm new to Python, and I don't know how to accomplish this. Any help is
greatly appreciated!

-----------------------------------------------------------------

Here is my code:

from xml.sax import make_parser
from xml.sax.handler import ContentHandler
import StringIO

class ErrorHandler:
def __init__(self, parser):
self.parser = parser
def warning(self, msg):
print '*** (ErrorHandler.warning) msg:', msg
def error(self, msg):
print '*** (ErrorHandler.error) msg:', msg
def fatalError(self, msg):
print msg

class ContentHandler(ContentHandler):
def __init__ (self):
pass
def startElement(self, name, attrs):
pass
def characters (self, ch):
pass
def endElement(self, name):
pass

xmlstr = """
<cities>
<city>
<name>Tampa</name>
<description>A great city and place to live</description>
</city>
<city>
<name>Clearwater</name>
<description>Beautiful beaches</description>
</city>
</cities>
"""
parser = make_parser()
curHandler = ContentHandler()
errorHandler = ErrorHandler(parser)
parser.setContentHandler(curHandler)
parser.setErrorHandler(errorHandler)
parser.parse(StringIO.StringIO(xmlstr))

Chris Lambacher · Jan 5, 2007

What exactly is invalid about the XML fragment you provided?
It seems to parse correctly with ElementTree:... <cities>
... <city>
... <name>Tampa</name>
... <description>A great city ^^ and place to live</description>
... </city>
... <city>
... <name>Clearwater</name>
... <description>Beautiful beaches</description>
... </city>
<cities>
<city>
<name>Tampa</name>
<description>A great city ^^ and place to live</description>
</city>
<city>
<name>Clearwater</name>
<description>Beautiful beaches</description>

Do you have invalid characters? unclosed tags? The solution to each of these
problems is different. More info will solicit better solutions.

-Chris

scott · Jan 6, 2007

My original posting has a funky line break character (it appears as an
ascii square) that blows up my program, but it may or may not show up
when you view my message.

I was afraid to use element tree, since my xml files can be very long,
and I was concerned about using memory structures to hold all the data.
It is my understanding that SAX reads the file line by line?

Is there a way to account for the invalid token in the error handler? I
don't mind parsing out the bad characters on a case-by-case basis. The
weather data I am ingesting only seems to have this line break
character that the parser doesn't like.

Thanks!

Scott

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Jan 6, 2007

My original posting has a funky line break character (it appears as an
ascii square) that blows up my program, but it may or may not show up
when you view my message.

Looking at your document, it seems that this "funky line break
character" is character \x1E, which, in latin-1, means "record
separator". It's indeed ill-formed to use it in XML.

Is there a way to account for the invalid token in the error handler?

Not with a standard XML parser, no. The error you describe is a "fatal
error", and that's not something parsing can recover from. I recommend
that you filter this character out before passing it to the XML parser.
You can use the IncrementalParser interface to do so.

Regards,
Martin

scott · Jan 7, 2007

Thanks, I'll work with the file on the file system, then parse it with
SAX.

Is there a Pythonic way to read the file and identify any illegal XML
characters so I can strip them out? this would keep my program more
flexible - if the vendor is going to allow one illegal character in
their document, there's no way of knowing if another one will pop up
later.

Thanks!

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Jan 7, 2007

Is there a Pythonic way to read the file and identify any illegal XML
characters so I can strip them out? this would keep my program more
flexible - if the vendor is going to allow one illegal character in
their document, there's no way of knowing if another one will pop up
later.

Notice that you are talking about bytes here, not characters. It is
inherently difficult to determine invalid bytes - you first have to
determine the encoding, then (mentally) decode, and then find out
whether there are any invalid characters.

The invalid XML characters can be found in

http://www.w3.org/TR/2006/REC-xml-20060816/#charsets

So invalid characters are #x0 .. #x8, #xB, #xC, #xE .. #x1F,
#xD800 .. #xDFFF, #xFFFE, #xFFFF.

If you restrict attention to only the invalid characters below
#x20 (i.e. control characters), and also restrict attention to
encodings that are strict ASCII supersets (ASCII, ISO-8859-x,
UTF-8), you can filter out the invalid characters on the byte
level. Otherwise, you have to decode, filter out on the character
level, and then encode again. Neither approach will deal with
bytes that are invalid wrt. the encoding.

To filter out these bytes, I recommend to use str.translate.
Make an identity table for the substitution, and put the
bytes you want deleted into the delete table.

Regards,
Martin

Daily WTF with XML, or error handling in SAX	0	May 3, 2008
Error handling in SAX	1	May 3, 2008
XML / Unicode / SAX question	2	Jul 4, 2007
SAX XML Parse Python error message	5	Jul 13, 2008
Splitting SAX results	6	Jun 7, 2007
Parsing xml file in python	5	Oct 30, 2007
XML file parsing with SAX	3	Apr 23, 2005
Help with XML-SAX program ... it's driving me nuts ...	2	Jan 31, 2006

Newbie XML SAX Parsing: How do I ignore an invalid token?

scott

Chris Lambacher

scott

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

scott

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads