xml.sax feature question

Discussion in 'Python' started by christof hoeke, Oct 25, 2003.

  1. hi,
    this is my first try with sax (and some of the first utils in python
    too) so the code is not the best. but i wrote a small utility which
    finds all used element names in a bunch of xml files. reason is simply
    to find out which elements are used and only partly a DTD is available.

    so with a os.path.walk over all xml-files in a dir includings subdirs a
    simple sax ContentHandler simply stores all names in a dictionary (to
    keep any given name only once).

    the problem i have is that if the xmlfile has a doctype declaration the
    sax parser tries to load it and fails (IOError if course).
    partly because the path to the DTD is just a simple name in the same dir
    e.g. <!DOCTYPE contacts SYSTEM "contacts.dtd"> and i guess the parser
    does not use the path os.path.walk uses (can i somehow give the parser
    this information?). but it also could be a DTD which should be loaded
    over a network which is not available at the time.

    at the moment these files are not processed at all.

    i guess to simply set a feature of the sax parser to not try to load any
    external DTDs should work. question is which feature do i have to disable?
    p = xml.sax.make_parser()
    p.setFeature('http://xml.org/sax/features/validation', False)

    i thought turning off the validation would stop the parser to load
    external DTDs, but it still tries to load them.
    any other suggestions?


    sorry for the rather lengthy explanation and code.
    thanks a lot!
    chris

    the complete code for a better understanding of my problem:

    import fnmatch, os.path, sys, xml.sax

    class ElementList:
    name = {}

    class Names(xml.sax.ContentHandler):
    def startElement(self, tag, attr):
    if not ElementList.name.has_key(tag):
    ElementList.name[tag] = 1
    else:
    ElementList.name[tag] += 1

    def process(self, file):
    try:
    #xml.sax.parse(file, ElementList.Names())
    p = xml.sax.make_parser()
    p.setContentHandler(ElementList.Names())
    p.setFeature('http://xml.org/sax/features/validation', False)
    p.parse(file)
    print '\t', file
    except (xml.sax.SAXException, IOError), e:
    print '\tNOT PROCESSED', file, e

    def printList(self):
    print
    print '#\t<ELEMENTNAME>'
    print '-\t-------------'
    keys = self.name.keys()
    keys.sort()
    for key in keys:
    print self.name[key], '\t', key

    class Lister:
    def __init__(self):
    self.el = ElementList()

    def process(self, dir):
    print
    print 'FILES'
    print '-----'
    def proc(junk, dir, files):
    for file in fnmatch.filter(files, '*.xml'):
    self.el.process(os.path.join(dir, file))
    os.path.walk(dir, proc, None)

    def printList(self):
    self.el.printList()

    #MAIN
    if __name__ == '__main__':
    try:
    dir = sys.argv[1]
    except:
    print "usage: python lister.py startdir"
    sys.exit(0)
    l = Lister()
    l.process(dir)
    l.printList()
    christof hoeke, Oct 25, 2003
    #1
    1. Advertising

  2. christof hoeke <> writes:

    > the problem i have is that if the xmlfile has a doctype declaration
    > the sax parser tries to load it and fails (IOError if course).
    > partly because the path to the DTD is just a simple name in the same
    > dir e.g. <!DOCTYPE contacts SYSTEM "contacts.dtd"> and i guess the
    > parser does not use the path os.path.walk uses (can i somehow give the
    > parser this information?). but it also could be a DTD which should be
    > loaded over a network which is not available at the time.


    In XML, the SYSTEM identifier is a URI reference; in your case, it is
    a relative URL. An XML processor must interpret this relative to the
    URL of the main document. If you have the main document on a local
    disk, the relative URL will be intepreted relative to the file name.
    So you should put the DTD along with the document (in the same
    directory).

    > i guess to simply set a feature of the sax parser to not try to load
    > any external DTDs should work. question is which feature do i have to
    > disable?
    > p = xml.sax.make_parser()
    > p.setFeature('http://xml.org/sax/features/validation', False)
    >
    > i thought turning off the validation would stop the parser to load
    > external DTDs, but it still tries to load them.


    This just turns of validation. The parser you are using is not
    validating anyway, so this has no effect. The parser still loads the
    DTD, in order to expand entity references it may encounter.

    > any other suggestions?


    You need to turn off resolution of general entities:

    p.setFeature("http://xml.org/sax/features/external-general-entities",False)

    Alternatively, you can install an entity handler which then uses a
    different mechanism of resolving the DTD (and other external entities).

    Regards,
    Martin
    Martin v. =?iso-8859-15?q?L=F6wis?=, Oct 26, 2003
    #2
    1. Advertising

  3. Martin v. Löwis wrote:

    > christof hoeke <> writes:
    >
    >
    >>the problem i have is that if the xmlfile has a doctype declaration
    >>the sax parser tries to load it and fails (IOError if course).
    >>partly because the path to the DTD is just a simple name in the same
    >>dir e.g. <!DOCTYPE contacts SYSTEM "contacts.dtd"> and i guess the
    >>parser does not use the path os.path.walk uses (can i somehow give the
    >>parser this information?). but it also could be a DTD which should be
    >>loaded over a network which is not available at the time.

    >
    >
    > In XML, the SYSTEM identifier is a URI reference; in your case, it is
    > a relative URL. An XML processor must interpret this relative to the
    > URL of the main document. If you have the main document on a local
    > disk, the relative URL will be intepreted relative to the file name.
    > So you should put the DTD along with the document (in the same
    > directory).


    this is what i did but still i get the exception for example for
    xmltest\contacts.xml "[Errno 2] No such file or directory:
    'contacts.dtd'" if xmltest contains contacts.xml with the SYSTEM
    identifier "contacts.dtd" and contacts.dtd is in the same directory.


    >>i guess to simply set a feature of the sax parser to not try to load
    >>any external DTDs should work.

    >
    > You need to turn off resolution of general entities:
    >
    > p.setFeature("http://xml.org/sax/features/external-general-entities",False)



    exactly what i was looking for, thanks a lot. still i wonder why the
    above error happens.

    > Alternatively, you can install an entity handler which then uses a
    > different mechanism of resolving the DTD (and other external entities).


    i think i get a copy of the sax2 book to look into that a bit more...

    thanks
    christof
    christof hoeke, Oct 26, 2003
    #3
  4. christof hoeke <> writes:

    > exactly what i was looking for, thanks a lot. still i wonder why the
    > above error happens.


    It appears that the standard entity resolver is

    class EntityResolver:
    def resolveEntity(self, publicId, systemId):
    return systemId

    So it just returns the system ID, instead of taking a base URL into
    account. I'm uncertain whether this is a limitation of PyXML, or SAX
    in general.

    Regards,
    Martin
    Martin v. =?iso-8859-15?q?L=F6wis?=, Oct 26, 2003
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Per Magnus L?vold
    Replies:
    0
    Views:
    1,377
    Per Magnus L?vold
    Nov 15, 2004
  2. Thomas Guetttler
    Replies:
    1
    Views:
    577
    Thomas Guetttler
    Sep 10, 2003
  3. Thomas Guettler

    xml.parsers.expat vs. xml.sax

    Thomas Guettler, Apr 27, 2004, in forum: Python
    Replies:
    2
    Views:
    894
    Martijn Faassen
    Apr 27, 2004
  4. Replies:
    2
    Views:
    500
  5. Erik Wasser
    Replies:
    5
    Views:
    449
    Peter J. Holzer
    Mar 5, 2006
Loading...

Share This Page