XML Validation with Python

Discussion in 'Python' started by Will Stuyvesant, Jul 2, 2003.

  1. Can you give a commandline example how to do XML Validation (checking
    against a DTD) with Python? Not with 4Suite or other 3rd party
    libraries, just the Python standard distribution. I have Python 2.2
    but can upgrade to 2.3 beta if needed.

    I am looking for something like:

    "
    $ python validate.py myxmlfile.xml mydtd.dtd
    "

    where validate.py contains something like:

    "
    import somexmllib
    import sys

    # prints 1 if Okay :)
    print somexmllib.validate(sys.argv[1], sys.argv[2])
    "

    I am sorry if this is a FAQ or if it is in one of the xml libraries, I
    just could not figure it out!
     
    Will Stuyvesant, Jul 2, 2003
    #1
    1. Advertising

  2. Will Stuyvesant

    Alan Kennedy Guest

    Will Stuyvesant wrote:

    > Can you give a commandline example how to do XML Validation (checking
    > against a DTD) with Python? Not with 4Suite or other 3rd party
    > libraries, just the Python standard distribution.


    You can't do it. The base distribution doesn't include a validating
    XML parser.

    The only pure python validating parser is Lars Garshol's "xmlproc",
    which is a part of pyxml (a "third-party" optional extension). You can
    read the documentation for xmlproc here

    http://www.garshol.priv.no/download/software/xmlproc/

    and the bit about validating on the command line is here

    http://www.garshol.priv.no/download/software/xmlproc/cmdline.html

    Is there any reason why it has to be in the base distribution?

    Assuming that you have a good reason, maybe you can tell us what
    platform you're running on? There might be a platform specific
    parser/validator that you can call from python.

    HTH,

    --
    alan kennedy
    -----------------------------------------------------
    check http headers here: http://xhaus.com/headers
    email alan: http://xhaus.com/mailto/alan
     
    Alan Kennedy, Jul 3, 2003
    #2
    1. Advertising

  3. I could not find a solution using the Python Standard
    Libraries to write a simple commandline utility to do
    XML validation. And I found the xml.sax documentation
    unclear, there are no good examples to look at. Also
    in the Python Cookbook and in the Python in a Nutshell
    book the XML examples are BAD. There is nowhere a
    motivation for the class library design, for example
    "why do you need a handler in a xml.sax.parse() and why
    is there no default handler", nor simple examples how
    to use it. I like the approach taken by the Python
    Standard Library book by Fredrik Lundh MUCH more: clear
    examples and explanations. A damn shame they do not
    want a new edition at O'Reilly, the poor guy is now
    putting a free version on his website.

    I have found a solution for XML validation using the
    3rd party pyRXP library from http://www.reportlab.com/xml/pyrxp.html
    Their "download and install" info is a mess, I
    downloaded first a .ZIP with
    only .DLL and .PYD files and it turned out you had to
    plunk that into C:\Python22\DLL. This made me turn
    away from pyRXP initially because bad installation
    usually means bad software. But later on I found a
    bigger .ZIP with more stuff so maybe I should've used
    that one? At least it works now. I can do "import
    pyRXP". Make sure you also download
    pyRXP_Documentation.pdf. This is good documentation
    with examples. I notice the docs in the other big .ZIP
    are in .RML format...whatever that is!

    I can not believe the amount of bad documentation and
    bad install approaches I see with 3rd party software.
    That is why I normally stick to Python Standard Library
    only.

    Anyway, I can now do XML validation, below is
    "validate.py". But I am not solving my initial
    problem: if it validates, then validate.py prints
    nothing, if there is a mistake then it prints an error
    message. What I really wanted; giving more confidence
    that the validation is okay; is to print 1 or 0
    depending on the result, but I have not figured out yet
    how to do that and now I am too tired of it all...

    # file: validate.py
    import sys
    if len(sys.argv)<2 or sys.argv[1] in ['-h','--help','/?']:
    print 'Usage: validate.py xmlfilename'
    sys.exit()
    import pyRXP
    p = pyRXP.Parser()
    fn=open(sys.argv[1], 'r').read()
    p.parse(fn)
     
    Will Stuyvesant, Jul 3, 2003
    #3
  4. > [Alan Kennedy <>]
    > The only pure python validating parser is Lars Garshol's "xmlproc",
    > which is a part of pyxml (a "third-party" optional extension). You can
    > read the documentation for xmlproc here
    >
    > http://www.garshol.priv.no/download/software/xmlproc/
    >
    > and the bit about validating on the command line is here
    >
    > http://www.garshol.priv.no/download/software/xmlproc/cmdline.html
    >
    > Is there any reason why it has to be in the base distribution?
    >


    Because I want to use it from a cgi script written in Python. And I
    am not allowed to install 3rd party stuff on the webserver. Even if I
    was it would not be a solution since it has to be easy to put it on
    another webserver. But of course: if there is a validating parser
    written completely in Python then I can use it too! If it runs under
    Python 2.1.1, that is (that is what they have at the website). I will
    investigate this www.garshol.priv.no link you gave me, thank you.
     
    Will Stuyvesant, Jul 3, 2003
    #4
  5. Will Stuyvesant

    Alan Kennedy Guest

    Will Stuyvesant wrote:

    > Because I want to use it from a cgi script written in Python. And I
    > am not allowed to install 3rd party stuff on the webserver. Even if I
    > was it would not be a solution since it has to be easy to put it on
    > another webserver. But of course: if there is a validating parser
    > written completely in Python then I can use it too! If it runs under
    > Python 2.1.1, that is (that is what they have at the website). I will
    > investigate this www.garshol.priv.no link you gave me, thank you.


    Glad to be of help.

    There is a comment on Lars site, which is vaguely worrying, which
    says:

    "Note that it is recommended to use xmlproc through the SAX API rather
    than directly, since this provides much greater freedom in the choice
    of
    parsers. (For example, you can switch to using Pyexpat which is
    written
    in C without changing your code.)"

    Which seems to indicate to me that the author is encouraging the user
    not to rely on xmlproc too much. Perhaps performance might be an
    issue?

    One more thing: There are alternative validation methods, which may or
    not be suitable, based on your requirements.

    For example, there is a python implementation of James Clark's Tree
    Regular EXpressions (TREX), written in pure python, and which uses the
    inbuilt C parser, written by James Tauber and called pytrex. I
    personally find trex and pytrex a very natural, and thus easy to
    learn, way to check structures in a tree, including data validation.
    Pytrex is not complete, and is no longer maintained, but what's there
    is good code, and with nice little features, such as the ability to
    define your own datatype validation functions, which are called at
    match time.

    http://pytrex.sourceforge.net/

    Pytrex is unlikely to be ever completed, because James Clark has
    abandoned TREX in favour of RELAX-NG, for which I haven't seen any
    python implementation.

    http://www.relaxng.org/

    There is a python implementation of XML-Schema, xsv, written by Henry
    Thompson, which I think was kept fairly up-to-date with the XML-Schema
    spec as it evolved. However, given the complexity of XML-Schema, and
    having never tried to use xsv, I have no idea of its stability.

    http://www.ltg.ed.ac.uk/~ht/xsv-status.html

    I note that the author also maintains a web service for validating
    documents.

    Are you sure that XML validation-parsing is the right solution for
    your problem? There may be simpler ways.

    --
    alan kennedy
    -----------------------------------------------------
    check http headers here: http://xhaus.com/headers
    email alan: http://xhaus.com/mailto/alan
     
    Alan Kennedy, Jul 3, 2003
    #5
  6. > [Alan Kennedy]
    > ... interesting links and comments ...
    > Are you sure that XML validation-parsing is the right solution for
    > your problem? There may be simpler ways.


    We have defined a new XML vocabulary with a DTD. I offered to make a
    webservice so everybody can validate their XML files based on this
    DTD. For this I use CGI with Python 2.1.1 and I have no web master
    privileges.

    The idea of web applications is nice in that you do not have to code
    GUIs anymore: you can do pretty much everything with (X)HTML.
    Sometimes you have to rethink your UI so it is possible to give every
    user state a URI. A big plus is that everybody can now use your
    application. And you can do more than I thought before, for example
    users can send files from their computer with type=FILE fields in
    forms. And for development you can just download Apache and install
    it on your laptop and configure it such that everything is exactly the
    same as on the target website (#!/usr/bin/python...means install their
    python version in C:\usr\bin on you laptop :)

    The big problem with web applications is all the permissions you need
    to install, compile, configure, etc. For Python CGI this means you
    are stuck with some Python version and you realize how important the
    Python Standard Library is.

    --
    Experience is what allows you to recognize a mistake the second time
    you make it.
     
    Will Stuyvesant, Jul 3, 2003
    #6
  7. Will Stuyvesant

    Asun Friere Guest

    (Will Stuyvesant) wrote in message news:<>...

    > Anyway, I can now do XML validation, below is
    > "validate.py". But I am not solving my initial
    > problem: if it validates, then validate.py prints
    > nothing, if there is a mistake then it prints an error
    > message. What I really wanted; giving more confidence
    > that the validation is okay; is to print 1 or 0
    > depending on the result, but I have not figured out yet
    > how to do that and now I am too tired of it all...


    This might do the trick:

    # file: validate.py
    import sys, pyRXP

    if len(sys.argv)<2 or sys.argv[1] in ['-h','--help','/?']:
    print 'Usage: validate.py xmlfilename'
    sys.exit()

    fn = open(sys.argv[1], 'r').read()
    try :
    pyRXP.Parser().parse(fn)
    print True
    except pyRXP.error :
    print False


    Though personally, rather than printing False, I would simply raise in
    the except clause, as the traceback provides the user with more
    information as to what is wrong with their xml.
     
    Asun Friere, Jul 29, 2003
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.

Share This Page