lxml precaching DTD for document verification.

Discussion in 'Python' started by Gelonida N, Nov 27, 2011.

  1. Gelonida N

    Gelonida N Guest

    Hi,

    I'd like to verify some (x)html / / html5 / xml documents from a server.

    These documents have a very limited number of different doc types / DTDs.

    So what I would like to do is to build a small DTD cache and some code,
    that would avoid searching the DTDs over and over from the net.

    What would be the best way to do this?
    I guess, that
    the fields od en ElementTre, that I have to look at are
    docinfo.public_id
    docinfo.system_uri

    There's also mentioning af a catalogue, but I don't know how to
    use a catalog and how to know what is inside my catalogue
    and what isn't.


    Below a non working skeleto (first shot):
    ---------------------------------------------
    Would this be the right way??

    ### ufnctions with '???' are not implemented / are the ones
    ### where I don't know whether they exist alreday.

    import os
    import urllib

    from lxml import etree

    cache_dir = os.path.join(os.environ['HOME'], ''.my_dtd_cache')

    def get_from_cache(docinfo):
    """ the function which I'd like to implement most efficiently """
    fpi = docinfo.public_id
    uri = docinfo.system_uri
    dtd = ???get_from_dtd_cache(fpi, uri)
    if dtd is not None:
    return dtd
    # how can I check what is in my 'catalogue'
    if ???dtd_in_catalogue(??):
    return ???get_dtd_from_catalogue???
    dtd_rdr = urllib.urlopen(uri)
    dtd_filename = ???create_cache_filename(docinfo)
    (fname, _headers) = urllib.urlretrieve(uri, dtd_filename)
    return etree.DTD(fname)


    def check_doc_cached(filename):
    """ function, which should report errors
    if a doc doesn't validate.
    """
    doc = etree.parse(filename)
    dtd = get_from_cache(doc.docinfo)
    rslt = dtd.validate(doc)
    if not rlst:
    print "validate error:"
    print(dtd.error_log.filter_from_errors()[0])
     
    Gelonida N, Nov 27, 2011
    #1
    1. Advertising

  2. Gelonida N

    Roy Smith Guest

    In article <>,
    Gelonida N <> wrote:

    > I'd like to verify some (x)html / / html5 / xml documents from a server.


    I'm sure you could roll your own validator with lxml and some DTDs, but
    you would probably save yourself a huge amount of effort by just using
    the validator the W3C provides (http://validator.w3.org/).
     
    Roy Smith, Nov 27, 2011
    #2
    1. Advertising

  3. Gelonida N

    John Gordon Guest

    In <> Roy Smith <> writes:

    > In article <>,
    > Gelonida N <> wrote:
    >
    > > I'd like to verify some (x)html / / html5 / xml documents from a server.


    > I'm sure you could roll your own validator with lxml and some DTDs, but
    > you would probably save yourself a huge amount of effort by just using
    > the validator the W3C provides (http://validator.w3.org/).


    With regards to XML, he may mean that he wants to validate that the
    document conforms to a specific format, not just that it is generally
    valid XML. I don't think the w3 validator will do that.

    --
    John Gordon A is for Amy, who fell down the stairs
    B is for Basil, assaulted by bears
    -- Edward Gorey, "The Gashlycrumb Tinies"
     
    John Gordon, Nov 27, 2011
    #3
  4. Gelonida N

    Gelonida N Guest

    On 11/27/2011 10:33 PM, John Gordon wrote:
    > In <> Roy Smith <> writes:
    >
    >> In article <>,
    >> Gelonida N <> wrote:
    >>
    >>> I'd like to verify some (x)html / / html5 / xml documents from a server.

    >
    >> I'm sure you could roll your own validator with lxml and some DTDs, but
    >> you would probably save yourself a huge amount of effort by just using
    >> the validator the W3C provides (http://validator.w3.org/).


    This validator requires that I post the code to some host.
    The contents that I'd like to verify is intranet contents, which I am
    not allowed to post to an external site.
    >
    > With regards to XML, he may mean that he wants to validate that the
    > document conforms to a specific format, not just that it is generally
    > valid XML. I don't think the w3 validator will do that.
    >



    Basically I want to integrate this into a django unit test.

    I noticed, that some of of the templates generate documents with
    mismatching DTD headers / contents.
    All of the HTML code is parsable as xml (if it isn't it's a bug)

    There are also some custom XML files, which have their specific DTDs

    So I thought about validating some of the generated html with lxml.

    the django test environment allows to run test clients, which are
    supposedly much faster than a real http client.
     
    Gelonida N, Nov 28, 2011
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Joseph Tilian
    Replies:
    0
    Views:
    379
    Joseph Tilian
    Dec 21, 2004
  2. Ronald Fischer
    Replies:
    4
    Views:
    1,806
    Ronald Fischer
    Mar 17, 2005
  3. Luc The Perverse

    Applet image precaching and interrupt

    Luc The Perverse, Feb 16, 2006, in forum: Java
    Replies:
    0
    Views:
    373
    Luc The Perverse
    Feb 16, 2006
  4. test
    Replies:
    2
    Views:
    2,177
    Oliver Wong
    Jul 28, 2006
  5. viza

    image precaching

    viza, Jul 31, 2003, in forum: Javascript
    Replies:
    4
    Views:
    107
    Janwillem Borleffs
    Jul 31, 2003
Loading...

Share This Page