XML Validation with Python

  • Thread starter Will Stuyvesant
  • Start date
W

Will Stuyvesant

Can you give a commandline example how to do XML Validation (checking
against a DTD) with Python? Not with 4Suite or other 3rd party
libraries, just the Python standard distribution. I have Python 2.2
but can upgrade to 2.3 beta if needed.

I am looking for something like:

"
$ python validate.py myxmlfile.xml mydtd.dtd
"

where validate.py contains something like:

"
import somexmllib
import sys

# prints 1 if Okay :)
print somexmllib.validate(sys.argv[1], sys.argv[2])
"

I am sorry if this is a FAQ or if it is in one of the xml libraries, I
just could not figure it out!
 
A

Alan Kennedy

Will said:
Can you give a commandline example how to do XML Validation (checking
against a DTD) with Python? Not with 4Suite or other 3rd party
libraries, just the Python standard distribution.

You can't do it. The base distribution doesn't include a validating
XML parser.

The only pure python validating parser is Lars Garshol's "xmlproc",
which is a part of pyxml (a "third-party" optional extension). You can
read the documentation for xmlproc here

http://www.garshol.priv.no/download/software/xmlproc/

and the bit about validating on the command line is here

http://www.garshol.priv.no/download/software/xmlproc/cmdline.html

Is there any reason why it has to be in the base distribution?

Assuming that you have a good reason, maybe you can tell us what
platform you're running on? There might be a platform specific
parser/validator that you can call from python.

HTH,
 
W

Will Stuyvesant

I could not find a solution using the Python Standard
Libraries to write a simple commandline utility to do
XML validation. And I found the xml.sax documentation
unclear, there are no good examples to look at. Also
in the Python Cookbook and in the Python in a Nutshell
book the XML examples are BAD. There is nowhere a
motivation for the class library design, for example
"why do you need a handler in a xml.sax.parse() and why
is there no default handler", nor simple examples how
to use it. I like the approach taken by the Python
Standard Library book by Fredrik Lundh MUCH more: clear
examples and explanations. A damn shame they do not
want a new edition at O'Reilly, the poor guy is now
putting a free version on his website.

I have found a solution for XML validation using the
3rd party pyRXP library from http://www.reportlab.com/xml/pyrxp.html
Their "download and install" info is a mess, I
downloaded first a .ZIP with
only .DLL and .PYD files and it turned out you had to
plunk that into C:\Python22\DLL. This made me turn
away from pyRXP initially because bad installation
usually means bad software. But later on I found a
bigger .ZIP with more stuff so maybe I should've used
that one? At least it works now. I can do "import
pyRXP". Make sure you also download
pyRXP_Documentation.pdf. This is good documentation
with examples. I notice the docs in the other big .ZIP
are in .RML format...whatever that is!

I can not believe the amount of bad documentation and
bad install approaches I see with 3rd party software.
That is why I normally stick to Python Standard Library
only.

Anyway, I can now do XML validation, below is
"validate.py". But I am not solving my initial
problem: if it validates, then validate.py prints
nothing, if there is a mistake then it prints an error
message. What I really wanted; giving more confidence
that the validation is okay; is to print 1 or 0
depending on the result, but I have not figured out yet
how to do that and now I am too tired of it all...

# file: validate.py
import sys
if len(sys.argv)<2 or sys.argv[1] in ['-h','--help','/?']:
print 'Usage: validate.py xmlfilename'
sys.exit()
import pyRXP
p = pyRXP.Parser()
fn=open(sys.argv[1], 'r').read()
p.parse(fn)
 
W

Will Stuyvesant

[Alan Kennedy said:
The only pure python validating parser is Lars Garshol's "xmlproc",
which is a part of pyxml (a "third-party" optional extension). You can
read the documentation for xmlproc here

http://www.garshol.priv.no/download/software/xmlproc/

and the bit about validating on the command line is here

http://www.garshol.priv.no/download/software/xmlproc/cmdline.html

Is there any reason why it has to be in the base distribution?

Because I want to use it from a cgi script written in Python. And I
am not allowed to install 3rd party stuff on the webserver. Even if I
was it would not be a solution since it has to be easy to put it on
another webserver. But of course: if there is a validating parser
written completely in Python then I can use it too! If it runs under
Python 2.1.1, that is (that is what they have at the website). I will
investigate this www.garshol.priv.no link you gave me, thank you.
 
A

Alan Kennedy

Will said:
Because I want to use it from a cgi script written in Python. And I
am not allowed to install 3rd party stuff on the webserver. Even if I
was it would not be a solution since it has to be easy to put it on
another webserver. But of course: if there is a validating parser
written completely in Python then I can use it too! If it runs under
Python 2.1.1, that is (that is what they have at the website). I will
investigate this www.garshol.priv.no link you gave me, thank you.

Glad to be of help.

There is a comment on Lars site, which is vaguely worrying, which
says:

"Note that it is recommended to use xmlproc through the SAX API rather
than directly, since this provides much greater freedom in the choice
of
parsers. (For example, you can switch to using Pyexpat which is
written
in C without changing your code.)"

Which seems to indicate to me that the author is encouraging the user
not to rely on xmlproc too much. Perhaps performance might be an
issue?

One more thing: There are alternative validation methods, which may or
not be suitable, based on your requirements.

For example, there is a python implementation of James Clark's Tree
Regular EXpressions (TREX), written in pure python, and which uses the
inbuilt C parser, written by James Tauber and called pytrex. I
personally find trex and pytrex a very natural, and thus easy to
learn, way to check structures in a tree, including data validation.
Pytrex is not complete, and is no longer maintained, but what's there
is good code, and with nice little features, such as the ability to
define your own datatype validation functions, which are called at
match time.

http://pytrex.sourceforge.net/

Pytrex is unlikely to be ever completed, because James Clark has
abandoned TREX in favour of RELAX-NG, for which I haven't seen any
python implementation.

http://www.relaxng.org/

There is a python implementation of XML-Schema, xsv, written by Henry
Thompson, which I think was kept fairly up-to-date with the XML-Schema
spec as it evolved. However, given the complexity of XML-Schema, and
having never tried to use xsv, I have no idea of its stability.

http://www.ltg.ed.ac.uk/~ht/xsv-status.html

I note that the author also maintains a web service for validating
documents.

Are you sure that XML validation-parsing is the right solution for
your problem? There may be simpler ways.
 
W

Will Stuyvesant

[Alan Kennedy]
... interesting links and comments ...
Are you sure that XML validation-parsing is the right solution for
your problem? There may be simpler ways.

We have defined a new XML vocabulary with a DTD. I offered to make a
webservice so everybody can validate their XML files based on this
DTD. For this I use CGI with Python 2.1.1 and I have no web master
privileges.

The idea of web applications is nice in that you do not have to code
GUIs anymore: you can do pretty much everything with (X)HTML.
Sometimes you have to rethink your UI so it is possible to give every
user state a URI. A big plus is that everybody can now use your
application. And you can do more than I thought before, for example
users can send files from their computer with type=FILE fields in
forms. And for development you can just download Apache and install
it on your laptop and configure it such that everything is exactly the
same as on the target website (#!/usr/bin/python...means install their
python version in C:\usr\bin on you laptop :)

The big problem with web applications is all the permissions you need
to install, compile, configure, etc. For Python CGI this means you
are stuck with some Python version and you realize how important the
Python Standard Library is.
 
A

Asun Friere

Anyway, I can now do XML validation, below is
"validate.py". But I am not solving my initial
problem: if it validates, then validate.py prints
nothing, if there is a mistake then it prints an error
message. What I really wanted; giving more confidence
that the validation is okay; is to print 1 or 0
depending on the result, but I have not figured out yet
how to do that and now I am too tired of it all...

This might do the trick:

# file: validate.py
import sys, pyRXP

if len(sys.argv)<2 or sys.argv[1] in ['-h','--help','/?']:
print 'Usage: validate.py xmlfilename'
sys.exit()

fn = open(sys.argv[1], 'r').read()
try :
pyRXP.Parser().parse(fn)
print True
except pyRXP.error :
print False


Though personally, rather than printing False, I would simply raise in
the except clause, as the traceback provides the user with more
information as to what is wrong with their xml.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,774
Messages
2,569,599
Members
45,175
Latest member
Vinay Kumar_ Nevatia
Top