lxml precaching DTD for document verification.

Gelonida N · Nov 27, 2011

Hi,

I'd like to verify some (x)html / / html5 / xml documents from a server.

These documents have a very limited number of different doc types / DTDs.

So what I would like to do is to build a small DTD cache and some code,
that would avoid searching the DTDs over and over from the net.

What would be the best way to do this?
I guess, that
the fields od en ElementTre, that I have to look at are
docinfo.public_id
docinfo.system_uri

There's also mentioning af a catalogue, but I don't know how to
use a catalog and how to know what is inside my catalogue
and what isn't.

Below a non working skeleto (first shot):
---------------------------------------------
Would this be the right way??

### ufnctions with '???' are not implemented / are the ones
### where I don't know whether they exist alreday.

import os
import urllib

from lxml import etree

cache_dir = os.path.join(os.environ['HOME'], ''.my_dtd_cache')

def get_from_cache(docinfo):
""" the function which I'd like to implement most efficiently """
fpi = docinfo.public_id
uri = docinfo.system_uri
dtd = ???get_from_dtd_cache(fpi, uri)
if dtd is not None:
return dtd
# how can I check what is in my 'catalogue'
if ???dtd_in_catalogue(??):
return ???get_dtd_from_catalogue???
dtd_rdr = urllib.urlopen(uri)
dtd_filename = ???create_cache_filename(docinfo)
(fname, _headers) = urllib.urlretrieve(uri, dtd_filename)
return etree.DTD(fname)

def check_doc_cached(filename):
""" function, which should report errors
if a doc doesn't validate.
"""
doc = etree.parse(filename)
dtd = get_from_cache(doc.docinfo)
rslt = dtd.validate(doc)
if not rlst:
print "validate error:"
print(dtd.error_log.filter_from_errors()[0])

Roy Smith · Nov 27, 2011

Gelonida N said:
I'd like to verify some (x)html / / html5 / xml documents from a server.

I'm sure you could roll your own validator with lxml and some DTDs, but
you would probably save yourself a huge amount of effort by just using
the validator the W3C provides (http://validator.w3.org/).

John Gordon · Nov 27, 2011

I'm sure you could roll your own validator with lxml and some DTDs, but
you would probably save yourself a huge amount of effort by just using
the validator the W3C provides (http://validator.w3.org/).

With regards to XML, he may mean that he wants to validate that the
document conforms to a specific format, not just that it is generally
valid XML. I don't think the w3 validator will do that.

Gelonida N · Nov 28, 2011

This validator requires that I post the code to some host.
The contents that I'd like to verify is intranet contents, which I am
not allowed to post to an external site.

With regards to XML, he may mean that he wants to validate that the
document conforms to a specific format, not just that it is generally
valid XML. I don't think the w3 validator will do that.

Basically I want to integrate this into a django unit test.

I noticed, that some of of the templates generate documents with
mismatching DTD headers / contents.
All of the HTML code is parsable as xml (if it isn't it's a bug)

There are also some custom XML files, which have their specific DTDs

So I thought about validating some of the generated html with lxml.

the django test environment allows to run test clients, which are
supposedly much faster than a real http client.

lxml and schema validation	1	Oct 3, 2008
Partly erratic wrong behaviour, Python 3, lxml	5	Mar 4, 2010
lxml removing tag, keeping text order	2	Oct 24, 2008
Validating XML with an external DTD	8	Aug 4, 2007
cannot find bean in any scope error	3	Feb 28, 2005
Can't make this page work	6	Mar 8, 2006

lxml precaching DTD for document verification.

Gelonida N

Roy Smith

John Gordon

Gelonida N

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads