Error handling in SAX

M

mrkafk

(this is a repost, for it's been a while since I posted this text via
Google Groups and it plain didn't appear on c.l.py - if it did appear
anyway, apols)

So I set out to learn handling three-letter-acronym files in Python,
and SAX worked nicely until I encountered badly formed XMLs, like with
bad characters in it (well Unicode supposed to handle it all but
apparently doesn't), using http://dchublist.com/hublist.xml.bz2 as
example data, with goal to extract Users and Address properties where
number of Users is greater than given number.

So I extended my First XML Example with an error handler:

# ========= snip ===========
from xml.sax import make_parser
from xml.sax.handler import ContentHandler
from xml.sax.handler import ErrorHandler

class HubHandler(ContentHandler):
def __init__(self, hublist):
self.Address = ''
self.Users = ''
hl = hublist
def startElement(self, name, attrs):
self.Address = attrs.get('Address',"")
self.Users = attrs.get('Users', "")
def endElement(self, name):
if name == "Hub" and int(self.Users) > 2000:
#print self.Address, self.Users
hl.append({self.Address: int(self.Users)})

class HubErrorHandler(ErrorHandler):
def __init__(self):
pass
def error(self, exception):
import sys
print "Error, exception: %s\n" % exception
def fatalError(self, exception):
print "Fatal Error, exception: %s\n" % exception

hl = []

parser = make_parser()

hHandler = HubHandler(hl)
errHandler = HubErrorHandler()

parser.setContentHandler(hHandler)
parser.setErrorHandler(errHandler)

fh = file('hublist.xml')
parser.parse(fh)

def compare(x,y):
if x.values()[0] > y.values()[0]:
return 1
elif x.values()[0] < y.values()[0]:
return -1
return 0

hl.sort(cmp=compare, reverse=True)

for h in hl:
print h.keys()[0], " ", h.values()[0]
# ========= snip ===========

And then BAM, Pythonwin has hit me:

Fatal Error, exception: hublist.xml:2247:11: not well-formed (invalid
token)

Fatal Error, exception: hublist.xml:2247:11: not well-formed (invalid
token)

Fatal Error, exception: hublist.xml:2247:11: not well-formed (invalid
token)

Fatal Error, exception: hublist.xml:2247:11: not well-formed (invalid
token)

Fatal Error, exception: hublist.xml:2247:11: not well-formed (invalid
token)


Just before the "RESTART" line, Windows has announced it killed
pythonw.exe process (I suppose it was a child process).

WTF is happening here? Wasn't fatalError method in the HubErrorHandler
supposed to handle the invalid tokens? And why is the message repeated
many times? My method is called apparently, but something in SAX goes
awry and the interpreter crashes.
 
S

Stefan Behnel

(this is a repost, for it's been a while since I posted this text via
Google Groups and it plain didn't appear on c.l.py - if it did appear
anyway, apols)

It did, although some people have added google groups to their kill file.

So I set out to learn handling three-letter-acronym files in Python,
and SAX worked nicely until I encountered badly formed XMLs, like with
bad characters in it (well Unicode supposed to handle it all but
apparently doesn't),

If it's not well-formed, it's not XML. XML parsers are required to reject non
well-formed input.

In case it actually is well-formed XML and the problem is somewhere in your
code but you can't see it through the SAX haze, try lxml. It also allows you
to pass the expected encoding to the parser to override broken document encodings.

http://codespeak.net/lxml/

Stefan
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,020
Latest member
GenesisGai

Latest Threads

Top