XML file parsing with SAX

W

Willem Ligtenberg

I decided to use SAX to parse my xml file.
But the parser crashes on:
File "/usr/lib/python2.3/site-packages/_xmlplus/sax/handler.py", line 38, in fatalError
raise exception
xml.sax._exceptions.SAXParseException: NCBI_Entrezgene.dtd:8:0: error in processing external entity reference

This is caused by:
<!DOCTYPE Entrezgene-Set PUBLIC "-//NCBI//NCBI Entrezgene/EN"
"NCBI_Entrezgene.dtd">

If I remove it, it parses normally.
I've created my parser like this:
import sys
from xml.sax import make_parser
from handler import EntrezGeneHandler

fopen = open("mouse2.xml", "r")
ch = EntrezGeneHandler()
saxparser = make_parser()
saxparser.setContentHandler(ch)
saxparser.parse(fopen)

And the handler is:
from xml.sax import ContentHandler

class EntrezGeneHandler(ContentHandler):
"""
A handler to deal with EntrezGene in XML
"""

def startElement(self, name, attrs):
print "Start element:", name

So it doesn't do much yet. And still it crashes...
How can I tell the parser not to look at the DOCTYPE declaration.
On a website:
http://www.devarticles.com/c/a/XML/Parsing-XML-with-SAX-and-Python/1/
it states that the SAX parsers are not validating, so this error shouldn't
even occur?

Cheers,

Willem
 
U

Uche Ogbuji

I decided to use SAX to parse my xml file.
But the parser crashes on:
File "/usr/lib/python2.3/site-packages/_xmlplus/sax/handler.py", line 38, in fatalError
raise exception
xml.sax._exceptions.SAXParseException: NCBI_Entrezgene.dtd:8:0: error in processing external entity reference

This is caused by:
<!DOCTYPE Entrezgene-Set PUBLIC "-//NCBI//NCBI Entrezgene/EN"
"NCBI_Entrezgene.dtd">

If I remove it, it parses normally.
I've created my parser like this:
import sys
from xml.sax import make_parser
from handler import EntrezGeneHandler

fopen = open("mouse2.xml", "r")
ch = EntrezGeneHandler()
saxparser = make_parser()
saxparser.setContentHandler(ch)
saxparser.parse(fopen)

And the handler is:
from xml.sax import ContentHandler

class EntrezGeneHandler(ContentHandler):
"""
A handler to deal with EntrezGene in XML
"""

def startElement(self, name, attrs):
print "Start element:", name

So it doesn't do much yet. And still it crashes...
How can I tell the parser not to look at the DOCTYPE declaration.
On a website:
http://www.devarticles.com/c/a/XML/Parsing-XML-with-SAX-and-Python/1/
it states that the SAX parsers are not validating, so this error shouldn't
even occur?

Just because it's not validating doesn't mean that the parser won't try
to read the external entity.

Maybe you're looking for

"""
feature_external_ges
Value: "http://xml.org/sax/features/external-general-entities"
true: Include all external general (text) entities.
false: Do not include external general entities.
access: (parsing) read-only; (not parsing) read/write
"""

Quote from:

http://docs.python.org/lib/module-xml.sax.handler.html

But you're on pretty shaky ground in any XML 1.x toolkit using a bogus
DTDecl in this way. Why go through the hassle? Why not use a catalog,
or remove the DTDecl?


--
Uche Ogbuji Fourthought, Inc.
http://uche.ogbuji.net http://fourthought.com
http://copia.ogbuji.net http://4Suite.org
Use CSS to display XML, part 2 - http://www-128.ibm.com/developerworks/edu/x-dw-x-xmlcss2-i.html
XML Output with 4Suite & AMara - http://www.xml.com/pub/a/2005/04/20/py-xml.html
Use XSLT to prepare XML for import into OpenOffice Calc - http://www.ibm.com/developerworks/xml/library/x-oocalc/
Schema standardization for top-down semantic transparency - http://www-128.ibm.com/developerworks/xml/library/x-think31.html
 
W

Willem Ligtenberg

I didn't make the XML file. And I don't like messing with other peoples
data. So I just want my SAX parser to ignore it. I can't help if other
people make it hard for me to read their xml file...
 
U

Uche Ogbuji

so that will be sax.handler.feature_external_ges = "false"
Yes.

And it will work?

Honestly, I'm not sure. It should, but I've found these edge cases a
bit hard to predict in the Python built-in libs :-(
But what about using a catalog? I am very new to python and XML...

Catalogs allow you to rewrite the IDs for entities and such. So if
you had an XML file with an entity at a URL, but you were working
offline, you could use a catalog to "redirect" the entity to a copy on
your local filesystem.

Problem, now that I think of it, is that I'm not sure you can specify
an catalog in PySAX. You might instead have to override the method
entityResolver in your handler (and be sure to ). See the example in
listing 1 and and discussion here:

http://www.xml.com/pub/a/2005/03/02/pyxml.html

Good luck.

--
Uche Ogbuji Fourthought, Inc.
http://uche.ogbuji.net http://fourthought.com
http://copia.ogbuji.net http://4Suite.org
Use CSS to display XML, part 2 -
http://www-128.ibm.com/developerworks/edu/x-dw-x-xmlcss2-i.html
XML Output with 4Suite & Amara -
http://www.xml.com/pub/a/2005/04/20/py-xml.htmlUse XSLT to prepare XML
for import into OpenOffice Calc -
http://www.ibm.com/developerworks/xml/library/x-oocalc/
Schema standardization for top-down semantic transparency -
http://www-128.ibm.com/developerworks/xml/library/x-think31.html
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,055
Latest member
SlimSparkKetoACVReview

Latest Threads

Top